# Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques

### Coordinators

 Kota Dohi Hitachi, Ltd. Keisuke Imoto Doshisha University Yuma Koizumi Google, Inc. Noboru Harada Daisuke Niizumi Tomoya Nishida Hitachi, Ltd. Harsh Purohit Hitachi, Ltd. Takashi Endo Hitachi, Ltd. Masaaki Yamamoto Hitachi, Ltd. Yohei Kawaguchi Hitachi, Ltd.

Challenge has ended. Full results for this task can be found in the page.

# Description

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial intelligence (AI)-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines. Figure 1 shows an overview of the detection system.

This task is the follow-up to DCASE 2020 Task 2 and DCASE 2021 Task 2. The task this year is to detect anomalous sounds under three main conditions:

1. Only normal sound clips are provided as training data (i.e., unsupervised learning scenario). In real-world factories, anomalies rarely occur and are highly diverse. Therefore, exhaustive patterns of anomalous sounds are impossible to create or collect and unknown anomalous sounds that were not observed in the given training data must be detected. This condition is the same as in DCASE 2020 Task 2 and DCASE 2021 Task 2.

2. Factors other than anomalies change the acoustic characteristics between training and test data (i.e., domain shift). In real-world cases, operational conditions of machines or environmental noise often differ between the training and testing phases. For example, the operation speed of a conveyor can change due to seasonal demand, or environmental noise can fluctuate depending on the states of surrounding machines. This condition is the same as in DCASE 2021 Task 2.

3. In test data, samples unaffected by domain shifts (source domain data) and those affected by domain shifts (target domain data) are mixed, and the source/target domain of each sample is not specified. Therefore, the model must detect anomalies with the same threshold value regardless of the domain (i.e., domain generalization).

# Domain generalization: Focus of task

Domain shifts are differences in acoustic characteristics between the training and test data, such as differences in operational speed, machine load, and environmental noise. Because these shifts are caused by factors other than anomalies, models trained with the source domain data can easily cause false positives. One solution is to track these shifts and use domain adaptation techniques to adapt the model using the target domain data. However, in real-world cases, the use of domain adaptation techniques can be costly or even impractical, so domain generalization techniques are preferred. Domain generalization techniques mainly use the source domain data to learn common features across different domains so that the model can generalize to both the source and target domain in the test data. In the training data, there are little to no samples of the target domain data. Domain generalization techniques are mainly preferred over domain adaptation techniques in the following four scenarios:

1. Domain shifts caused by differences in machine’s physical parameters
Characteristics of a machine sound can change due to changes in the machine’s physical parameter. Although these shifts can be tracked, if the parameter changes within a short period of time, it can be too costly to adapt the model every time the value changes.

2. Domain shifts caused by differences in environmental conditions
Because characteristics of background noise can be affected by various factors, it is difficult to track these shifts. Therefore, a model that is unaffected by these shifts is desirable.

3. Domain shifts caused by maintenance
Characteristics of a machine sound can change after maintenance or parts replacement. Though these shifts can be tracked, adapting the model every time can be costly.

4. Domain shifts caused by differences in recording devices
In real-world scenarios, many microphones are installed at different locations, and these microphones may be from different manufacturers. Although these shifts can be tracked, adapting the model for each location or microphone can be too costly.

Each section of the dataset this year was made to reflect one of these scenarios.

## Schedule

Based on the DCASE Challenge 2022 schedule, the task important days will be as follows.

• Task open: 15th of March 2022
• Additional training dataset release: 15th of April 2022
• Evaluation dataset release: 1st of June 2022
• External resource list lock: 1st of June 2022
• Challenge deadline: 15th of June 2022
• Challenge results: 1st of July 2022

External resources on the "List of external datasets and models allowed" can be used (cf. external data resource section). List of external datasets and models allowed will be updated upon request. Any external resource which are freely accessed before 15th of April 2022 can be added. Please send a request email to the task organizers. The list will be locked after the release date of evaluation dataset (1st of June 2022). To avoid developing new external resources using machine information in the evaluation dataset, we will release the additional training dataset after 15th of April 2022. Note that the additional training dataset contains matching training data of machines used in the evaluation dataset (cf. dataset section).

# Audio datasets

## Dataset overview

The data used for this task consists of the normal/anomalous operating sounds of seven types of machines. Each recording is a single-channel 10-sec length audio clip that includes both the sounds of the target machine and environmental sounds. The following seven types of machines are used in this task:

• Fan
• Gearbox
• Bearing
• Slide rail
• Toy car
• Toy train
• Valve

Figure 2 shows an overview of the datasets for development, additional training, and evaluation. Each dataset consists of the seven types of machines, and each type of machine consists of three “sections” (Sections 00, 01, and 02 for the development dataset and Sections 03, 04, and 05 for the additional training dataset and the evaluation dataset).

## Definition

We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

• "Machine type" indicates the kind of machine, which in this task is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.
• A section is defined as a subset of the dataset for calculating performance metrics. Each section is dedicated to a specific type of domain shift.
• The source domain is the domain under which most of the training data and part of the test data were recorded, and the target domain is a different set of domains under which a few of the training data and part of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, SNR, etc.
• Attributes are parameters that define states of machines or types of noise.

## Development, additional training, and evaluation datasets

Our entire dataset consists of three datasets:

1. Development dataset: This dataset consists of three sections for each machine type (Sections 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

2. Additional training dataset: This dataset provides the other three sections for each machine type (Sections 03, 04, and 05). Each section consists of (i) 990 clips of normal sounds in the source domain for training and (ii) ten clips of normal sounds in a target domain for training. The domain of each sample and attributes are provide. Participants may also use this dataset for training. The additional training dataset will be open on April 15th.

3. Evaluation dataset: This dataset provides test clips for the three sections identical to the additional training dataset (Sections 03, 04, and 05). Each section has 200 test clips, none of which have a condition label (i.e., normal or anomaly) or information about the domain to which it belongs (i.e., source or target). Attributes are not provided. Note that the sections of the evaluation dataset (Sections 03, 04, and 05) are the same as that of the additional training dataset but different from the development dataset (Sections 00, 01, and 02).

As described above, the main modification from the 2021 version is that the task is a domain generalization task and the source/target domain of test data in the evaluation dataset is not provided. Most of the training data (990 clips for each section) are from a domain called the source domain, and only a few normal clips (ten clips for each section) are provided for another domain called the target domain. Because the domain of test data in the evaluation dataset is not specified, the participants are encouraged to develop a system that can detect anomalies with the same threshold value for all domains.

## File names and attribute csv files

File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...


## Recording procedure

Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

## Short description of each section in the development dataset

Short descriptions of each section in the development dataset and the attribute format in the file names of their training data are as follows:

 Machine type Section Description Attribute format in file names of training data Fan 00 Mixing of different machine sound between domains. m-n(machine-noise)_ Fan 01 Mixing of different factory noise between domains. f-n(factory-noise)_ Fan 02 Different levels of noise between domains. n-lv(noise-level)_ Gearbox 00 Different operation voltage between domains. volt(voltage)_ Gearbox 01 Different weight attached to the box between domains. wt(weight)_ Gearbox 02 Different gearbox ID between domains. id(machine ID)_ Bearing 00 Different rotation velocity between domains. vel(velocity)_ Bearing 01 Different microphone location between domains. vel(velocity)__loc(location of the microphone) Bearing 02 Mixing of different factory noise between domains. vel(velocity)__f-n(factory-noise) Slide rail 00 Different operation velocity between domains. vel(velocity)_ Slide rail 01 Different acceleration between domains. ac(acceleration)_ Slide rail 02 Mixing of different factory noise between domains. f-n(factory-noise)_ ToyCar 00 Car model and noise variations between domains. car__spd(speed)__mic__noise_ ToyCar 01 Speed level and noise variations between domains. car__spd(speed)__mic__noise_ ToyCar 02 Different microphone types, positions, and noise between domains. car__spd(speed)__mic__noise_ ToyTrain 00 Train model and noise variations between domains. car__spd(speed)__mic__noise_ ToyTrain 01 Speed level and noise variations between domains. car__spd(speed)__mic__noise_ ToyTrain 02 Different microphone types, positions, and noise between domains. car__spd(speed)__mic__noise_ Valve 00 Open/close operation patterns varies between domains. pat(pattern)_ Valve 01 Number and location of panels around the valve varies between domains. Open mean that there are no panels. pat(pattern)__panel_ Valve 02 Multiple valves are running simultaneously in the target domain. The anomaly condition means that a small piece of paper is caught in one or more of the valves. v1 and v2 refers to the valve index. Source domain: v1pat(valve1 pattern)_ or v2pat(valve2 pattern)_ Target domain: v1pat__v2pat_

## Short description of each section in the additional training dataset

Short descriptions of each section in the additional training dataset and the attribute format in the file names of their training data are as follows:

 Machine type Section Description Attribute format in file names of additional training data Fan 03 Mixing of different machine sound between domains. m-n(machine-noise)_ Fan 04 Mixing of different factory noise between domains. f-n(factory-noise)_ Fan 05 Different levels of noise between domains. n-lv(noise-level)_ Gearbox 03 Different operation voltage between domains. volt(voltage)_ Gearbox 04 Different weight attached to the box between domains. wt(weight)_ Gearbox 05 Different gearbox ID between domains. id(machine ID)_ Bearing 03 Different rotation velocity between domains. vel(velocity)_ Bearing 04 Different microphone location between domains. vel(velocity)__loc(location of the microphone) Bearing 05 Mixing of different factory noise between domains. vel(velocity)__f-n(factory-noise) Slide rail 03 Different operation velocity between domains. vel(velocity)_ Slide rail 04 Different acceleration between domains. ac(acceleration)_ Slide rail 05 Mixing of different factory noise between domains. f-n(factory-noise)_ ToyCar 03 Car model and noise variations between domains. car__spd(speed)__mic__noise_ ToyCar 04 Speed level and noise variations between domains. car__spd(speed)__mic__noise_ ToyCar 05 Different microphone types, positions, and noise between domains. car__spd(speed)__mic__noise_ ToyTrain 03 Train model and noise variations between domains. car__spd(speed)__mic__noise_ ToyTrain 04 Speed level and noise variations between domains. car__spd(speed)__mic__noise_ ToyTrain 05 Different microphone types, positions, and noise between domains. car__spd(speed)__mic__noise_ Valve 03 Open/close operation patterns varies between domains. pat(pattern)_ Valve 04 Number and location of panels around the valve varies between domains. Open mean that there are no panels. pat(pattern)__panel_ Valve 05 Multiple valves are running simultaneously in the target domain. The anomaly condition means that a small piece of paper is caught in one or more of the valves. v1 and v2 refers to the valve index. Source domain: v1pat(valve1 pattern)_ or v2pat(valve2 pattern)_ Target domain: v1pat__v2pat_

## External data resources

Based on the past DCASE's external data resource policy, we allow the use of external datasets and trained models under the following conditions:

1. Any test data in both development and evaluation datasets shall not be used for training.
2. Any data in ToyADMOS, ToyADMOS2, MIMII Dataset, MIMII DUE Dataset, the datasets of DCASE 2020 Challenge Task 2 , and the dataset of DCASE 2021 Challenge Task 2 shall not be used.
3. Datasets, pre-trained models, and pre-trained parameters on the "List of external data resources allowed" can be used. The list will be updated upon request. Datasets, pre-trained models, and pre-trained parameters, which are freely accessible by any other research group before 15th of April 2022, can be added to the list.
4. To add sources of external datasets, pre-trained models, or pre-trained parameters to the list, send a request to the organizers by the evaluation set publishing date. To give an equal opportunity to use them for all competitors, we will update the "list of external data resources allowed" on the web page accordingly.
5. Once the evaluation set is published, no further external sources will be added. The list will be locked after 1st of June 2022.

### List of external data resources allowed:

IDMT-ISA-ELECTRIC-ENGINE audio 15.03.2022 https://www.idmt.fraunhofer.de/en/publications/isa-electric-engine.html
VGGish model 15.03.2022 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
PANNs model 15.03.2022 https://zenodo.org/record/3576403/
PyTorch Image Models (including tens of pre-trained models) model 15.03.2022 https://github.com/rwightman/pytorch-image-models
torchvision.models (including tens of pre-trained models) model 15.03.2022 https://pytorch.org/vision/stable/models.html

Development dataset (6.2 GB)
version 1.0

Evaluation dataset (1.0 GB)
version 1.0

Participants are required to submit both an anomaly score and a normal/anomaly decision result for each test clip. The anomaly score for each test clip will be used to calculate the area under the receiver operating characteristic (ROC) curve (AUC) and partial-AUC (pAUC) scores, which are used to calculate an official score and a final ranking. The normal/anomaly decision result for each test clip is used to calculate the precision, recall, and F1 scores, which will also be published when the challenge results are open. The method of evaluation is described in the Evaluation section.

The anomaly score takes a large value when the input signal seems to be anomalous, and vice versa. To calculate the anomaly score, participants need to train an anomaly score calculator $$\mathcal{A}$$ with parameter $$\theta$$. The input of $$\mathcal{A}$$ is a machine's operating sound $$x \in \mathbb{R}^L$$ and its machine information including machine type, section index, and other attribute information, and $$\mathcal{A}$$ outputs one anomaly score for the whole audio clip $$x$$ as $$\mathcal{A}_\theta (x) \in \mathbb{R}$$. Then, $$x$$ is determined to be anomalous when the anomaly score exceeds a pre-defined threshold value. Thus, $$\mathcal{A}$$ needs to be trained so that $$\mathcal{A}_\theta(x)$$ will be a large value both when the whole audio clip $$x$$ is anomalous and when a part of $$x$$ is anomalous, such as with collision anomalous sounds.

Figure 3 shows the overview of this task, where the example is a procedure for calculating the anomaly scores of the test clips of (fan, section 00, target domain). First, the participants train an anomaly score calculator $$\mathcal{A}$$ using training data both in the source and target domains and optional external data resources. Then, by using $$\mathcal{A}$$, participants calculate anomaly scores of all the test clips of (fan, section 00, target domain). By repeating this procedure, participants calculate the anomaly score of all the test clips of all the machine types, sections, and domains.

Arbitral numbers of an anomaly score calculator $$\mathcal{A}$$ can be used to calculate the anomaly scores of test clips. The simplest strategy is to use a single $$\mathcal{A}$$ to calculate the anomaly scores for a single section (e.g., section 00). In this case, $$\mathcal{A}$$ is specialized to a single section, so users of such a system are required to train $$\mathcal{A}$$ for each machine type, each product, and each condition. A more challenging strategy is to use a single $$\mathcal{A}$$ to calculate the anomaly scores of all the test clips of all the machine types and sections. The advantage of this strategy is that participants can use all the training clips provided; however, they need to consider the generalization of the model. Another typical scenario that can be inspired by real-world applications is where you train a general model only with the source-domain data. The task organizers do not impose this constraint but would appreciate participants’ efforts to impose constraints on themselves based on various real-world applications.

All training data with arbitrary splitting can be used to train an anomaly score calculator. For example, to train $$\mathcal{A}$$ to calculate the anomaly score of (valve, section 03, source domain), participants can opt to use training data only in (valve, section 03, source domain), training data in both the source domain and target domains, training data of all sections of valves, all provided training data, and/or other strategies. Of course, normal/anomalous clips in test data cannot be used for training; however, simulating anomalous samples using the listed external data resources is allowed.

Changing the model (model/architecture/hyperparameters) between machine types within a single submission is allowed. However, we expect participants to develop a simple ASD system, (i.e. keep the model and hyperparameters fixed and only change the training data to adapt to each machine type).

# Submission

The official challenge submission consists of:

• System output for the evaluation data
• Meta information files

System output should be presented as a text-file that corresponds to each machine type, section index. Its file name should be:

• Anomaly score file: anomaly_score_<machine_type>_section_<section_index>.csv
• Detection result file: decision_result_<machine_type>_section_<section_index>.csv

The anomaly score file (in CSV format, without header row) contains the anomaly score for each audio file in the test data of the evaluation dataset. Result items can be in any order. All rows must be in the following format:

[filename (string)],[anomaly score (real value)]


Anomaly scores in the second column can take a negative value. For example, typical auto-encoder-based anomaly score calculators use the squared reconstruction error, which takes a non-negative value, while statistical model-based methods (such as GMM) use the negative log-likelihood as the anomaly score, which can take both positive and negative values.

The decision result file (in CSV format, without header row) contains the normal/anomaly decision result for each audio file in the test data of the evaluation dataset. Result items can be in any order. All rows must be the following format:

[filename (string)],[decision result (0: normal, 1: anomaly)]


We allow up to four system output submissions per participant/team. For each system, meta information should be provided in a separate file that contains the task-specific information. All files should be packaged into a zip file for submission. Detailed information on the submission process can be found on the Submission page.

# Evaluation

## Metrics

This task is evaluated with the AUC and the pAUC. The pAUC is an AUC calculated from a portion of the ROC curve over the pre-specified range of interest.

Because this task focuses on domain generalization, the anomaly detector is expected to work with the same threshold regardless of the domain. Therefore, data from both domains in a section are used to calculate the AUC and pAUC. Also, in order to evaluate the detection performance for each domain, the AUC is calculated for each domain. The AUC for each machine type, section, and domain (source/target) and the pAUC for each machine type and section are defined as

$${\rm AUC}_{m, n, d} = \frac{1}{N^{-}_{d}N^{+}_{n}} \sum_{i=1}^{N^{-}_{d}} \sum_{j=1}^{N^{+}_{n}} \mathcal{H} (\mathcal{A}_{\theta} (x_{j}^{+}) - \mathcal{A}_{\theta} (x_{i}^{-})),$$
$${\rm pAUC}_{m, n} = \frac{1}{\lfloor p N^{-}_{n} \rfloor N^{+}_{n}} \sum_{i=1}^{\lfloor p N^{-}_{n} \rfloor} \sum_{j=1}^{N^{+}_{n}} \mathcal{H} (\mathcal{A}_{\theta} (x_{j}^{+}) - \mathcal{A}_{\theta} (x_{i}^{-}))$$

where $$m$$ represents the index of a machine type, $$n$$ represents the index of a section, $$d = \{ {\rm source}, {\rm target} \}$$ represents a domain, $$\lfloor \cdot \rfloor$$ is the flooring function, and $$\mathcal{H} (x)$$ returns 1 when $$x$$ > 0 and 0 otherwise. Here, $$\{x_{i}^{−}\}_{i=1}^{N^{-}_{d}}$$ is normal test clips in the domain $$d$$ in the section $$n$$ in the machine type $$m$$ and $$\{x_{j}^{+}\}_{j=1}^{N^{+}_{n}}$$ is anomalous test clips in the section $$n$$ in the machine type $$m$$, respectively, and they have been sorted so that their anomaly scores are in descending order. Here, $$N^{-}_{d}$$ is the number of normal test clips in the domain $$d$$ in the section $$n$$ in the machine type $$m$$, $$N^{-}_{n}$$ are the number of normal test clips in the section $$n$$ in the machine type $$m$$, and $$N^{+}_{n}$$ is the number of anomalous test clips in the section $$n$$ in the machine type $$m$$, respectively.

In our metric, the pAUC is calculated as the AUC over a low false-positive-rate (FPR) range $$[0, p]$$. The reason for the additional use of the pAUC is based on practical requirements. If an ASD system frequently gives false alarms, we cannot trust it. Therefore, it is especially important to increase the true-positive-rate under low FPR conditions. In this task, we will use $$p=0.1$$.

The official score $$\Omega$$ for each submitted system is given by the harmonic mean of the AUC and pAUC scores over all the machine types, sections, and domains as follows:

$$\Omega = h \left\{ {\rm AUC}_{m, n, d}, \ {\rm pAUC}_{m, n} \quad | \quad m \in \mathcal{M}, \ n \in \mathcal{S}(m), \ d \in \{ {\rm source}, {\rm target} \} \right\},$$

where $$h\left\{\cdot\right\}$$ represents the harmonic mean (over all machine types, sections, and domains), $$\mathcal{M}$$ represents the set of the machine types, and $$\mathcal{S}(m)$$ represents the set of the sections for the machine type $$m$$.

As the equations above show, a threshold value does not need to be determined to calculate AUC, pAUC, or the official score because the threshold value is the anomaly scores of normal test clips. However, in real applications, the threshold value must be determined, and a decision must be made as to whether it is normal or anomalous. Therefore, participants are also required to submit the normal/anomaly decision results. The organizers will publish the AUC, pAUC, and official scores as well as the precision, recall, and F1-scores calculated for the normal/anomaly decision results.

Note: The submitted normal/anomaly decision results will not be used for the final ranking because the task organizers do not want to encourage participants to use a forbidden approach (i.e., threshold tuning based on the distribution in the evaluation dataset). Do not use other test clips to determine anomalies for each test clip.

## Ranking

The final ranking will be decided by sorting based on the official score $$\Omega$$.

# Baseline system

The task organizers provide two baseline systems that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the ASD task.

Autoencoder-based baseline is an example of the inlier modeling (IM)-based detector that models the distribution of inlier samples, and the MobileNetV2-based baseline is an example of the outlier exposure (OE)-based detector that utilizes outlier samples for modeling. In the previous year's task, ensembling of IM-based detectors and OE-based detectors or IM-based detection using features learned in a machine-identification task were common.

## Autoencoder-based baseline

This is an IM-based detector that uses autoencoder (AE). The anomaly score is calculated as the reconstruction error of the observed sound. To obtain small anomaly scores for normal sounds, the AE is trained to minimize the reconstruction error of the normal training data. This method is based on the assumption that the AE cannot reconstruct sounds that are not used in training, that is unknown anomalous sounds.

In the baseline system, we first calculate the log-mel-spectrogram of the input $$X = \{X_t\}_{t = 1}^T$$ where $$X_t \in \mathbb{R}^F$$, and $$F$$ and $$T$$ are the number of mel-filters and time-frames, respectively. Then, the acoustic feature at $$t$$ is obtained by concatenating consecutive frames of the log-mel-spectrogram as $$\psi_t = (X_t, \cdots, X_{t + P - 1}) \in \mathbb{R}^D$$, where $$D = P \times F$$, and $$P$$ is the number of frames of the context window. The anomaly score is calculated as:

$$A_{\theta}(X) = \frac{1}{DT} \sum_{t = 1}^T \| \psi_t - r_{\theta}(\psi_t) \|_{2}^{2},$$

where $$r_{\theta}$$ is the vector reconstructed by the autoencoder, and $$\| \cdot \|_2$$ is $$\ell_2$$ norm.

To determine the anomaly detection threshold, we assume that $$A_{\theta}$$ follows a gamma distribution. The parameters of the gamma distribution are estimated from the histogram of $$A_{\theta}$$, and the anomaly detection threshold is determined as the 90th percentile of the gamma distribution. If $$A_{\theta}$$ for each test clip is greater than this threshold, the clip is judged to be abnormal; if it is smaller, it is judged to be normal.

### Parameters

#### Acoustic features

• The frame size for STFT is 64 ms (50 % hop size)
• Log-mel energies for 128 ($$= F$$) bands
• 5 ($$= P$$) consecutive frames are concatenated.
• 640 ($$= D = P \times F$$) dimensions are input to the autoencoder.

#### Network Architecture

• Input shape: 640
• Architecture:
• Dense layer #1
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #2
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #3
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #4
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Bottleneck layer
• Dense layer (units: 8)
• Batch Normalization
• Activation (ReLU)
• Dense layer #5
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #6
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #7
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #8
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Output layer
• Dense layer (units: 640)
• Learning (epochs: 100, batch size: 512, data shuffling between epochs)
• Optimizer: Adam (learning rate: 0.001)

## MobileNetV2-based baseline

This is an OE-based detector that uses MobileNetV2. This baseline identifies from which section the observed signal was generated. In other words, it outputs the softmax value that is the predicted probability for each section. The anomaly score is calculated as the averaged negative logit of the predicted probabilities for the correct section.

We first calculate the log-mel-spectrogram of the input $$X = \{X_t\}_{t = 1}^T$$ where $$X_t \in \mathbb{R}^F$$, and $$F$$ and $$T$$ are the number of mel-filters and time-frames, respectively. Then, the acoustic feature (two-dimensional image) at $$t$$ is obtained by concatenating consecutive frames of the log-mel-spectrogram as $$\psi_t = (X_t, \cdots, X_{t + P - 1}) \in \mathbb{R}^{P \times F}$$, where $$P$$ is the number of frames of the context window. By shifting the context window by $$L$$ frames, $$B (= \lfloor \frac{T - P}{L} \rfloor)$$ images are extracted. The anomaly score is calculated as:

$$A_{\theta}(X) = \frac{1}{B} \sum_{b = 1}^B \log {\LARGE \{} \frac{1 - p_{\theta}(\psi_{t(b)})}{p_{\theta}(\psi_{t(b)})} {\LARGE \}},$$

where $$t(b)$$ is the beginning frame index of the $$b$$-th image, and $$p_{\theta}$$ is the softmax output by MobileNetV2 for the correct section.

To determine the anomaly detection threshold, we assume that $$A_{\theta}$$ follows a gamma distribution. The parameters of the gamma distribution are estimated from the histogram of $$A_{\theta}$$, and the anomaly detection threshold is determined as the 90th percentile of the gamma distribution. If $$A_{\theta}$$ for each test clip is greater than this threshold, the clip is judged to be abnormal; if it is smaller, it is judged to be normal.

### Parameters

#### Acoustic features

• The frame size for STFT is 64 ms (50 % hop size)
• Log-mel energies for 128 ($$= F$$) bands
• 64 ($$= P$$) consecutive frames are concatenated.
• Images of size $$P \times F$$ are input to a network using MobileNetV2.
• The context window is shifted by 8 ($$= L$$) frames (called "hop frames" or "stride").

#### Network Architecture

• Input shape: $$64 \times 128$$ image
• Architecture:
• Triplication layer
• Triplication of the input image to each color channel
• MobileNetV2
• Input: $$64 \times 128 \times 3$$ image
• Output: Softmax for 3 sections (Section 00, 01, and 02)
• Learning (epochs: 20, batch size: 32, data shuffling between epochs)
• Optimizer: Adam (learning rate: 0.00001)

## Repository

Detailed information can be found in the GitHub repository. The directory structure is briefly described here as a reference for label information. When you unzip the files downloaded from the GitHub repository and Zenodo, you can see the following directory structure. As described in the Dataset section, the machine type information is given by directory name, and the section index, domain, and the condition information are given by file name, as:

• /00_train.py
• /01_test.py
• /common.py
• /keras_model.py
• /baseline.yaml
• /00_train.py
• /01_test.py
• /common.py
• /keras_model.py
• /baseline.yaml
• /dev_data
• /fan
• /train (only normal clips)
• /section_00_source_train_normal_0000_.wav
• ...
• /section_00_source_train_normal_0989_.wav
• /section_00_target_train_normal_0000_.wav
• ...
• /section_00_target_train_normal_0009_.wav
• /section_01_source_train_normal_0000_.wav
• ...
• /section_02_target_train_normal_0009_.wav
• /test
• /section_00_source_test_normal_0000_.wav
• ...
• /section_00_source_test_normal_0049_.wav
• /section_00_source_test_anomaly_0000_.wav
• ...
• /section_00_source_test_anomaly_0049_.wav
• /section_00_target_test_normal_0000_.wav
• ...
• /section_00_target_test_normal_0049_.wav
• /section_00_target_test_anomaly_0000_.wav
• ...
• /section_00_target_test_anomaly_0049_.wav
• /section_01_source_test_normal_0000_.wav
• ...
• /section_02_target_test_anomaly_0049_.wav
• attributes_00.csv (attribute csv for section 00)
• attributes_01.csv (attribute csv for section 01)
• attributes_02.csv (attribute csv for section 02)
• /gearbox (The other machine types have the same directory structure as fan.)
• /bearing
• /slider (slider means "slide rail")
• /ToyCar
• /ToyTrain
• /valve
• /eval_data
• /fan
• /train (after launch of the additional training dataset)
• /section_03_source_train_normal_0000_.wav
• ...
• /section_03_source_train_normal_0989_.wav
• /section_03_target_train_normal_0000_.wav
• ...
• /section_03_target_train_normal_0009_.wav
• /section_04_source_train_normal_0000_.wav
• ...
• /section_05_target_train_normal_0009_.wav
• /test (after launch of the evaluation dataset)
• /section_03_test_0000.wav
• ...
• /section_03_test_0199.wav
• /section_04_test_0000.wav
• ...
• /section_05_test_0199.wav
• attributes_03.csv (attribute csv for train data in section 03)
• attributes_04.csv (attribute csv for train data in section 04)
• attributes_05.csv (attribute csv for train data in section 05)
• /gearbox (The other machine types have the same directory structure as fan.)
• /bearing
• /slider (slider means "slide rail")
• /ToyCar
• /ToyTrain
• /valve

After you run the training script 00_train.py and the test script 01_test.py, a csv file for each section index and domain that lists the anomaly scores for each clip will be stored in the directory result/. Also, a csv file for each section index and domain that lists the normal/anomaly decision results for each clip will be stored in the same directory. You can get more detailed information in the GitHub repository.

## Results with the development dataset

We evaluated the AUC and pAUC on the development dataset using several types of GPUs (RTX 2080, etc.). Because the results produced with a GPU are generally non-deterministic, the average and standard deviations from these five independent trials (training and testing) are shown in the following table.

### Detailed results for AE-based baseline

 ToyCarSection012Arithmetic meanHarmonic mean AUC source (Ave.)86.42 %89.85 %98.84 %91.70 %90.41 % AUC source (Std.)1.10 %1.39 %0.52 %0.75 %0.78 % AUC target (Ave.)41.48 %41.93 %26.50 %36.64 %34.81 % AUC target (Std.)6.11 %5.36 %13.52 %3.41 %4.99 % pAUC (Ave.)51.31 %54.08 %52.96 %52.79 %52.74 % pAUC (Std.)1.34 %1.84 %3.15 %1.04 %1.01 % ToyTrainSection012Arithmetic meanHarmonic mean AUC source (Ave.)67.54 %79.32 %84.08 %76.98 %76.32 % AUC source (Std.)0.97 %0.82 %0.38 %0.21 %0.18 % AUC target (Ave.)33.68 %29.87 %15.52 %26.36 %23.35 % AUC target (Std.)3.12 %5.62 %14.90 %1.32 %5.87 % pAUC (Ave.)52.72 %50.64 %48.33 %50.56 %50.48 % pAUC (Std.)1.63 %2.33 %2.33 %0.78 %0.78 % bearingSection012Arithmetic meanHarmonic mean AUC source (Ave.)57.48 %71.03 %42.34 %56.95 %54.42 % AUC source (Std.)3.66 %3.02 %3.79 %2.48 %2.36 % AUC target (Ave.)63.07 %61.04 %52.91 %59.01 %58.38 % AUC target (Std.)3.01 %16.97 %2.61 %7.26 %7.58 % pAUC (Ave.)51.49 %55.85 %49.18 %52.18 %51.98 % pAUC (Std.)1.07 %8.4 %2.03 %3.68 %3.43 % fanSection012Arithmetic meanHarmonic mean AUC source (Ave.)84.69 %71.69 %80.54 %78.97 %78.59 % AUC source (Std.)1.74 %0.69 %1.42 %1.1 %1.03 % AUC target (Ave.)39.35 %44.74 %63.49 %49.19 %47.18 % AUC target (Std.)9.35 %1.79 %2.36 %3.71 %4.59 % pAUC (Ave.)59.95 %51.12 %62.88 %57.98 %57.52 % pAUC (Std.)2.00 %0.55 %1.55 %0.67 %0.64 % gearboxSection012Arithmetic meanHarmonic mean AUC source (Ave.)64.63 %67.66 %75.38 %69.22 %68.93 % AUC source (Std.)0.88 %0.51 %0.75 %0.27 %0.26 % AUC target (Ave.)64.79 %58.12 %65.57 %62.83 %62.64 % AUC target (Std.)1.06 %0.38 %0.82 %0.49 %0.45 % pAUC (Ave.)60.93 %53.74 %61.51 %58.72 %58.49 % pAUC (Std.)2.31 %0.56 %0.69 %0.76 %0.71 % sliderSection012Arithmetic meanHarmonic mean AUC source (Ave.)81.92 %67.85 %86.66 %78.81 %77.95 % AUC source (Std.)0.81 %0.53 %0.39 %0.36 %0.35 % AUC target (Ave.)58.04 %50.3 %38.78 %49.04 %47.67 % AUC target (Std.)1.22 %1.25 %5.13 %1.28 %1.91 % pAUC (Ave.)61.65 %53.06 %53.44 %56.05 %55.78 % pAUC (Std.)1.22 %0.53 %1.18 %0.57 %0.55 % valveSection012Arithmetic meanHarmonic mean AUC source (Ave.)54.24 %50.45 %51.56 %52.09 %52.01 % AUC source (Std.)0.68 %3.67 %2.89 %0.18 %0.23 % AUC target (Ave.)52.73 %53.01 %43.84 %49.86 %49.46 % AUC target (Std.)1.93 %1.73 %1.11 %0.64 %0.68 % pAUC (Ave.)52.15 %49.78 %49.24 %50.39 %50.36 % pAUC (Std.)0.25 %0.19 %0.65 %0.24 %0.24 %

### Detailed results for MobileNetV2-based baseline

 ToyCarSection012Arithmetic meanHarmonic mean AUC source (Ave.)47.40 %62.02 %74.19 %61.21 %59.12 % AUC source (Std.)7.22 %11.07 %7.94 %8.28 %8.20 % AUC target (Ave.)56.40 %56.38 %45.64 %52.81 %51.96 % AUC target (Std.)4.11 %11.31 %11.32 %4.67 %5.19 % pAUC (Ave.)49.96 %50.92 %56.51 %52.46 %52.27 % pAUC (Std.)2.56 %2.52 %6.07 %3.03 %2.90 % ToyTrainSection012Arithmetic meanHarmonic mean AUC source (Ave.)46.02 %71.96 %63.23 %60.40 %57.26 % AUC source (Std.)12.21 %5.72 %25.60 %8.36 %9.22 % AUC target (Ave.)49.41 %45.14 %44.34 %46.30 %45.90 % AUC target (Std.)15.14 %13.66 %21.50 %14.49 %14.33 % pAUC (Ave.)50.25 %52.97 %51.54 %51.59 %51.52 % pAUC (Std.)1.49 %4.61 %4.34 %1.90 %1.92 % bearingSection012Arithmetic meanHarmonic mean AUC source (Ave.)67.85 %59.67 %61.71 %63.07 %60.58 % AUC source (Std.)19.61 %12.67 %33.52 %11.31 %10.87 % AUC target (Ave.)60.17 %64.65 %60.55 %61.79 %59.94 % AUC target (Std.)7.24 %12.63 %35.10 %13.57 %17.91 % pAUC (Ave.)54.41 %55.09 %64.18 %57.89 %57.14 % pAUC (Std.)5.72 %3.36 %19.79 %7.39 %5.96 % fanSection012Arithmetic meanHarmonic mean AUC source (Ave.)71.07 %76.26 %67.29 %71.54 %70.75 % AUC source (Std.)19.84 %4.95 %10.34 %7.80 %8.70 % AUC target (Ave.)62.13 %35.12 %58.02 %51.76 %48.22 % AUC target (Std.)12.50 %13.38 %7.46 %6.79 %8.15 % pAUC (Ave.)55.40 %52.14 %65.14 %57.56 %56.9 % pAUC (Std.)11.29 %4.08 %1.09 %3.54 %3.32 % gearboxSection012Arithmetic meanHarmonic mean AUC source (Ave.)63.54 %66.68 %80.87 %70.37 %69.21 % AUC source (Std.)9.46 %12.29 %7.85 %5.56 %5.97 % AUC target (Ave.)67.02 %66.96 %43.15 %59.04 %56.19 % AUC target (Std.)13.50 %8.92 %16.12 %9.02 %10.14 % pAUC (Ave.)62.12 %56.85 %50.62 %56.53 %56.03 % pAUC (Std.)11.66 %4.47 %7.73 %7.12 %7.02 % sliderSection012Arithmetic meanHarmonic mean AUC source (Ave.)87.15 %49.66 %72.70 %69.84 %65.15 % AUC source (Std.)2.71 %30.46 %11.67 %10.89 %16.30 % AUC target (Ave.)80.77 %32.07 %32.94 %48.59 %38.23 % AUC target (Std.)4.53 %46.84 %19.77 %7.97 %11.39 % pAUC (Ave.)71.57 %48.21 %49.69 %56.49 %54.67 % pAUC (Std.)5.28 %2.73 %1.63 %2.26 %1.49 % valveSection012Arithmetic meanHarmonic mean AUC source (Ave.)75.26 %54.78 %76.26 %68.77 %67.09 % AUC source (Std.)4.84 %5.37 %1.02 %1.25 %1.52 % AUC target (Ave.)43.60 %60.43 %78.74 %60.92 %57.22 % AUC target (Std.)14.38 %5.08 %2.64 %4.13 %6.63 % pAUC (Ave.)55.37 %54.69 %85.74 %65.27 %62.42 % pAUC (Std.)5.86 %3.87 %0.08 %1.44 %1.86 %

# Citation

If you are participating in this task or using the MIMII DG and ToyADMOS2, and baseline code, please cite the following three papers:

Publication

Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In arXiv e-prints: 2205.13879, 2022.

#### MIMII DG: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection for Domain Generalization Task

Publication

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), 1–5. Barcelona, Spain, November 2021.

#### ToyADMOS2: Another Dataset of Miniature-Machine Operating Sounds for Anomalous Sound Detection under Domain Shift Conditions

##### Abstract

This paper proposes a new large-scale dataset called “ToyADMOS” for anomaly detection in machine operating sounds (ADMOS). As with our previous ToyADMOS dataset, we collected a large number of operating sounds of miniature machines (toys) under normal and anomaly conditions by deliberately damaging them, but extended them in this case by providing a controlled depth of damages in the anomaly samples. Since typical application scenarios of ADMOS require robust performance under domain-shift conditions, the ToyADMOS2 dataset is designed for evaluating systems under such conditions. The released dataset consists of two sub-datasets for machine-condition inspection: fault diagnosis of machines with geometrically fixed tasks and fault diagnosis of machines with moving tasks. Domain shifts are represented by introducing several differences in operating conditions, such as the use of the same machine type but with different models and parts configurations, operating speeds, microphone arrangements, etc. Each subdataset contains over 27 k samples of normal machine-operating sounds and over 8 k samples of anomalous sounds recorded with five to eight microphones. The dataset is freely available for download at https://github.com/nttcslab/ToyADMOS2-dataset and https://doi.org/10.5281/zenodo.4580270.

Publication

Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Masaaki Yamamoto, and Yohei Kawaguchi. Description and discussion on DCASE 2022 challenge task 2: unsupervised anomalous sound detection for machine condition monitoring applying domain generalization techniques. In arXiv e-prints: 2206.05876, 2022.