# Unsupervised Anomalous Sound Detection for Machine Condition Monitoring under Domain Shifted Conditions

### Coordinators

 Yohei Kawaguchi Hitachi, Ltd. Keisuke Imoto Doshisha University Yuma Koizumi Google, Inc. Noboru Harada Daisuke Niizumi Kota Dohi Hitachi, Ltd. Ryo Tanabe Hitachi, Ltd. Harsh Purohit Hitachi, Ltd. Takashi Endo Hitachi, Ltd.

Challenge has ended. Full results for this task can be found in the page.

# Description

Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a machine is normal or anomalous. Automatic detection of mechanical failure is essential technology in the fourth industrial revolution, including artificial intelligence (AI)-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for machine condition monitoring. Figure 1 shows a simplified task description.

This task is the follow-up to DCASE 2020 Task 2. The 2021 version has two main challenges:

1. The task is detecting unknown anomalous sounds under the condition that only normal sound clips have been provided as training data, as in 2020 (i.e., unsupervised learning scenario). In real-world factories, actual anomalous sounds rarely occur and are highly diverse. Therefore, exhaustive patterns of anomalous sounds are impossible to deliberately make and/or collect. This means we have to detect unknown anomalous sounds that were not observed in the given training data.

2. The task is performed under the conditions that the acoustic characteristics of the training data and the test data are different (i.e., domain shift). The scope includes differences in operating speed, machine load, environmental noise, etc. Compared to 2020, the advancement of this task is that only a few normal clips are provided for the test domain, so this additional setup simulates a more realistic situation.

The first one is the same as the DCASE 2020 task 2, but the second one is newly added for this year.

## Why focus on domain shift?

The purpose of this task is to solve the problem of normal sounds being incorrectly judged as anomalous due to changes within the normal conditions (i.e., domain shift). The task setup of the 2020 version was the ASD under ideal conditions. That is, the training- and testing-phase datasets were generated under the same recording conditions, and enough normal training clips recorded under the test domain were made available. In contrast, real-world cases are more complicated and often involve different machine operating conditions between the training and testing phases. A frequent example of this is when the motor speed continuously varies in a conveyor transporting products on a production line based on the production volume in response to product demand. Since there is infinite variation in rotation speed, the sound will also change with infinite variation. Due to the seasonal demand for many products, a limited period of recording training data limits the motor speed during that period (e.g., 200-300 rpm for autumn) and variations in the training data. However, in the test phase, the ASD system must continue to monitor the conveyor through all seasons, so it must be able to monitor all possible motor speed conditions, including those that differ from the training data (such as 100-400 rpm). In addition to the conditions of the machine, environmental noise conditions (SNR, sound characteristics, etc.) also fluctuate uncontrollably depending on the seasonal demand. In such a situation, the normal state's distribution will be changed (i.e., domain shift).

## Schedule

Based on the DCASE 2021 Challenge schedule, the important task days will be as follows:

• Task open: 1st of March 2021
• Additional training dataset release: 1st of April 2021
• Evaluation dataset release: 1st of June 2021
• External resource list lock: 1st of June 2021
• Challenge deadline: 15th of June 2021
• Challenge results: 1st of July 2021

External resources on the "List of external datasets and models allowed" can be used (cf. external data resource section). The list of external datasets and models allowed will be updated upon request. Any external resources that are freely accessed before the 1st of April 2021 can be added. Please send a request email to the task organizers. The list will be locked after the release date of the evaluation dataset (1st of June 2021). To avoid developing new external resources using machine information in the evaluation dataset, we will release the additional training dataset after the 1st of April 2021. Note that the additional training dataset contains the matching training data of machines used in the evaluation dataset (cf. dataset section).

# Audio datasets

## Dataset overview

The data used for this task consists of the normal/anomalous operating sounds of seven types of machines. Each recording is single-channel, 10-second audio that includes both the sounds of a machine and its associated equipment as well as environmental sounds. The following seven types of machines are used in this task:

• Fan
• Gearbox
• Pump
• Slide rail (also called slider)
• ToyCar
• ToyTrain
• Valve

Figure 2 shows an overview of the datasets: the development, additional training, and evaluation datasets. Every dataset consists of the seven types of machines, and each type of machine consists of three "sections" (Section 00, 01, and 02 for the development dataset and Section 03, 04, and 05 for the additional training dataset and the evaluation dataset).

## Definition

First, we define some important terms in this task: "machine type," "section," "source domain," and "target domain."

• The machine type means the kind of machine, which can be one of seven in this task: fan, gearbox, pump, slide rail, ToyCar, ToyTrain, and valve.
• The section is defined as a subset of the data within one machine type and consists of data from two domains, called the source domain and the target domain, which are described next. The section is a unit for calculating performance metrics and is almost identical to what was called "machine ID" in the 2020 version. In the 2020 version, there was a one-to-one correspondence between machine IDs and products, but in the 2021 version, the same product may appear in different sections, and different products may appear in the same section.
• The source domain means the original condition which has enough amount of training clips, and the target domain means the shifted condition where only a few audio clips has been provided as training data. The conditions of source and target domains differ in terms of operating speed, machine load, viscosity, heating temperature, environmental noise, SNR, etc.

## Development and evaluation datasets

Our whole dataset consists of three datasets:

Development dataset: This dataset consists of three sections for each machine type (Section 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) around 1,000 clips of normal sounds in a source domain for training, (ii) only three clips of normal sounds in a target domain for training, (iii) around 100 clips each of normal and anomalous sounds in the source domain for the test, and (iv) around 100 clips each of normal and anomalous sounds in the target domain for the test.
Additional training dataset: This dataset provides the other three sections for each machine type (Section 03, 04, and 05). Each section consists of (i) around 1,000 clips of normal sounds in a source domain for training and (ii) only three clips of normal sounds in a target domain for training. Participants may also use this dataset for training. The additional training dataset will be open on the 1st of April.
Evaluation dataset: This dataset provides test clips for the three sections identical to the additional training dataset (Section 03, 04, and 05). Each section consists of (i) test clips in the source domain and (ii) test clips in the target domain, none of which have a condition label (i.e., normal or anomaly). Note that the sections of the evaluation dataset (Section 03, 04, and 05) are the same as the additional training dataset but different from the development dataset (Section 00, 01, and 02).

As described above, the modification from the 2020 version is that the training dataset is extremely unbalanced. Most of the training data (1000 clips for each section) are from the source domain, and only a few normal clips (three clips for each section) are provided for the target domain, even though the number of clips in the test data set is the same in both domains. Thus, the participants have to develop a system adapted to the target domain using a few clips while avoiding performance degradation for the source domain.

## Recording procedure

Normal/anomalous operating sounds of machines and related equipment were recorded. Anomalous sounds were collected by deliberately damaging machines. To simplify the task, we only used the first channel of the multi-channel recordings; all recordings were regarded as single-channel recordings from a fixed microphone. We mixed a machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise clips were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

## Reference labels

The given labels for each training/test clip are machine type, section index, normal/anomaly information, and brief attribute information about conditions other than normal/abnormal. The machine type information is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information is given by their respective file names. For the training data, the attribute information is given by their respective file names.

## Short description of each section in the development dataset

Short descriptions of each section in the development dataset and the attribute format in the file names of their training data are as follows:

 Machine type Section Description Attribute format in file names of training data Fan 00 Wind strength variations between domains strength__ambient Fan 01 Two products from the same manufacturer with size variations between domains strength_1__ambient : big or small Fan 02 Factory noise variations between domains Source domain: strength_1_ambient Target domain: strength_1_Fact_D_ambient Gearbox 00 Voltage variations between domains _g__mm__mV_none Gearbox 01 Arm-length variations between domains _g__mm__mV_none Gearbox 02 Weight variations between domains _g__mm__mV_none Pump 00 Submersible pump; viscosity variations between domains serial_no__ Pump 01 SNR variations between domains serial_no__ Pump 02 Multiple pumps running simultaneously in the target domain; anomaly condition indicating an abnormality in one or more of the pumps serial_no_ Slide rail 00 Ball-screw-type slide rail;velocity variations between domains vel_dis_accl Slide rail 01 Ball-screw-type slide rail;operation pattern variations between domains vel_dis_accl Slide rail 02 Belt-type slide rail;belt material variations between domains _none ToyCar 00 Car model variations between domains SpdVMic ToyCar 01 Speed level variations between domains SpdVMic ToyCar 02 Different microphone types and positions between domains SpdVMic ToyTrain 00 Train model variations between domains SpdMic ToyTrain 01 Speed level variations between domains SpdMic ToyTrain 02 Different microphone types and positions between domains SpdMic Valve 00 Open/close operation pattern variations between domains pattern__ Valve 01 No pump running in the source domain pattern__ Valve 02 Two valves running simultaneously in the target domain; anomaly condition indicates a small piece of metal is caught in one of the valves Source domain: pattern__ Target domain: pattern___comb

## Short description of each section in the additional training dataset

Short descriptions of each section in the additional training dataset (and the evaluation dataset) and the attribute format in the file names of their data are as follows:

 Machine type Section Description Attribute format in file names of training data Fan 03 Wind strength variations between domains strength__ambient Fan 04 Wind strength variations between domains strength__ambient Fan 05 Heating temperature variations between domains Source domain: strength_1_temp_min Target domain: strength_1_temp_max Gearbox 03 Voltage variations between domains _g__mm__mV_none Gearbox 04 Arm-length variations between domains _g__mm__mV_none Gearbox 05 Voltage and arm-length variations between domains _g__mm__mV_none Pump 03 Submersible pump; viscosity variations between domains serial_no__ Pump 04 Factory noise variations between domains serial_no_ Pump 05 Multiple pumps running simultaneously in the target domain; anomaly condition indicating an abnormality in one or more of the pumps serial_no_ Slide rail 03 Ball-screw-type slide rail;velocity variations between domains vel_dis_accl Slide rail 04 Ball-screw-type slide rail;operation pattern variations between domains vel_dis_accl Slide rail 05 Belt-type slide rail;belt material variations between domains _none ToyCar 03 Car model variations between domains SpdVMic ToyCar 04 Speed level variations between domains SpdVMic ToyCar 05 Different microphone types and positions between domains SpdVMic ToyTrain 03 Train model variations between domains SpdMic ToyTrain 04 Speed level variations between domains SpdMic ToyTrain 05 Different microphone types and positions between domains SpdMic Valve 03 Open/close operation pattern variations between domains pattern__ Valve 04 No pump running in the source domain; water flowing in the target domain pattern__ Valve 05 Two valves running simultaneously in the target domain; anomaly condition indicates a small piece of metal is caught in one of the valves Source domain: pattern__ Target domain: pattern___comb

## External data resources

Based on the previous DCASE external data resource policy, we allow the use of external datasets and trained models under the following conditions:

1. Any test data in both development and evaluation datasets shall not be used for training.
2. Any data in the ToyADMOS dataset, the MIMII dataset, and the datasets of DCASE 2020 Challenge Task 2 shall not be used.
3. Datasets, pre-trained models, and pre-trained parameters on the "List of external data resources allowed" can be used. The list will be updated upon request. Datasets, pre-trained models, and pre-trained parameters, which are freely accessible to any other research group before the 1st of April 2021, can be added to the list.
4. If you want us to add the sources of the external datasets, pre-trained models, and pre-trained parameters to the list, send a request to the organizers by the evaluation dataset’s publishing date. We will update the "list of external data resources allowed" on the web page accordingly to give all competitors an equal opportunity to use them.
5. Once the evaluation set is published, no further external sources will be added. The list will be locked after the 1st of June 2021.

### List of external data resources allowed:

IDMT-ISA-ELECTRIC-ENGINE audio 01.03.2021 https://www.idmt.fraunhofer.de/en/publications/isa-electric-engine.html
VGGish model 01.03.2021 https://github.com/tensorflow/models/tree/master/research/audioset/vggish
PANNs model 01.03.2021 https://zenodo.org/record/3576403/
PyTorch Image Models (including tens of pre-trained models) model 13.05.2021 https://github.com/rwightman/pytorch-image-models
torchvision.models (including tens of pre-trained models) model 13.05.2021 https://pytorch.org/vision/stable/models.html

Development dataset (7.5 GB)
version 1.0

Evaluation dataset (2.2 GB)
version 1.0

Participants are required to submit both an anomaly score and a normal/anomaly decision result for each test clip. The anomaly score for each test clip will be used to calculate the area under the receiver operating characteristic (ROC) curve (AUC) and partial-AUC (pAUC) scores, which are used to calculate an official score and a final ranking. The normal/anomaly decision result for each test clip is used to calculate the precision, recall, and F1 scores, which will also be published when the challenge results are open. The method of evaluation is described in the Evaluation section.

The anomaly score takes a large value when the input signal seems to be anomalous, and vice versa. To calculate the anomaly score, participants need to train an anomaly score calculator $$\mathcal{A}$$ with parameter $$\theta$$. The input of $$\mathcal{A}$$ is a machine's operating sound $$x \in \mathbb{R}^{L}$$ and its machine information including machine type, section index, and whether it is in a source or target domain, and $$\mathcal{A}$$ outputs one anomaly score for the whole audio clip $$x$$ as $$\mathcal{A}_\theta (x) \in \mathbb{R}$$. Then, $$x$$ is determined to be anomalous when the anomaly score exceeds a pre-defined threshold value. Thus, $$\mathcal{A}$$ needs to be trained so that $$\mathcal{A}_\theta(x)$$ will be a large value both when the whole audio clip $$x$$ is anomalous and when a part of $$x$$ is anomalous, such as with collision anomalous sounds.

Figure 3 shows the overview of this task, where the example is a procedure for calculating the anomaly scores of the test clips of (fan, section 00, target domain). First, the participants train an anomaly score calculator $$\mathcal{A}$$ using training data both in the source and target domains and optional external data resources. Then, by using $$\mathcal{A}$$, participants calculate the anomaly scores of all the test clips of (fan, section 00, target domain). By repeating this procedure, participants calculate the anomaly score of all the test clips of all the machine types, sections, and domains.

Arbitral numbers of an anomaly score calculator $$\mathcal{A}$$ can be used for to calculate the anomaly scores of test clips. The simplest strategy is to use a single $$\mathcal{A}$$ to calculate the anomaly scores for a single section (e.g., section 00) and a single domain (i.e., the source or target domain). In this case, $$\mathcal{A}$$ is specialized to a single section and a single domain, so users of such a system are required to train $$\mathcal{A}$$ for each machine type, product, and condition. A more challenging strategy is to use a single $$\mathcal{A}$$ to calculate the anomaly scores of all the test clips of all the machine types, sections, and both source and target domains. The advantage of this strategy is that participants can use all the training clips provided; however, they need to consider the generalization of the model. Another typical scenario that can be inspired by real-world applications is where you can pre-train a general model with source-domain data but need to adapt the general model to the target domains without the source-domain data. The task organizers do not impose this constraint but would appreciate participants' efforts to impose constraints on themselves based on various real-world applications.

All training data with arbitrary splitting can be used to train an anomaly score calculator. For example, to train $$\mathcal{A}$$ to calculate the anomaly score of (valve, section 03, source domain), participants can opt to use training data only in (valve, section 03, source domain), training data in both the source and target domains, training data of all sections of valves, all provided training data, and/or other strategies. Of course, normal/anomalous clips in test data cannot be used for training; however, simulating anomalous samples using the listed external data resources is allowed.

Changing the model (model/architecture/hyperparameters) between machine types within a single submission is allowed. However, we expect participants to develop a simple ASD system (i.e. keep the model and hyperparameters fixed and only change the training data to adapt to each machine type).

# Submission

The official challenge submission consists of:

• System output for the evaluation data
• Meta information files

System output should be presented as a text-file that corresponds to each machine type, section index, and domain (source/target). Its file name should be:

• Anomaly score file: anomaly_score_<machine_type>_section_<section_index>_<domain>.csv
• Detection result file: decision_result_<machine_type>_section_<section_index>_<domain>.csv

The anomaly score file (in CSV format, without header row) contains the anomaly score for each audio file in the test data of the evaluation dataset. Result items can be in any order. All rows must be in the following format:

[filename (string)],[anomaly score (real value)]


Anomaly scores in the second column can take a negative value. For example, typical auto-encoder-based anomaly score calculators use the squared reconstruction error, which takes a non-negative value, while statistical model-based methods (such as GMM) use the negative log-likelihood as the anomaly score, which can take both positive and negative values.

The decision result file (in CSV format, without header row) contains the normal/anomaly decision result for each audio file in the test data of the evaluation dataset. Result items can be in any order. All rows must be in the following format:

[filename (string)],[decision result (0: normal, 1: anomaly)]


We allow up to four system output submissions per participant/team. For each system, meta information should be provided in a separate file that contains the task-specific information. All files should be packaged into a zip file for submission. Detailed information on the submission process can be found on the Submission page.

# Evaluation

## Metrics

This task is evaluated with the AUC and the pAUC. The pAUC is an AUC calculated from a portion of the ROC curve over the pre-specified range of interest. In our metric, the pAUC is calculated as the AUC over a low false-positive-rate (FPR) range $$[0, p]$$. The AUC and pAUC for each machine type, section, and domain are defined as

$${\rm AUC}_{m, m, d} = \frac{1}{N_{-}N_{+}} \sum_{i=1}^{N_{-}} \sum_{j=1}^{N_{+}} \mathcal{H} (\mathcal{A}_{\theta} (x_{j}^{+}) - \mathcal{A}_{\theta} (x_{i}^{-})),$$
$${\rm pAUC}_{m, m, d} = \frac{1}{\lfloor p N_{-} \rfloor N_{+}} \sum_{i=1}^{\lfloor p N_{-} \rfloor} \sum_{j=1}^{N_{+}} \mathcal{H} (\mathcal{A}_{\theta} (x_{j}^{+}) - \mathcal{A}_{\theta} (x_{i}^{-}))$$

where $$m$$ represents the index of a machine type, $$n$$ represents the index of a section, $$d = \{ {\rm source}, {\rm target} \}$$ represents a domain, $$\lfloor \cdot \rfloor$$ is the flooring function, and $$\mathcal{H} (x)$$ returns 1 when $$x$$ > 0 and 0 otherwise. Here, $$\{x_{i}^{−}\}_{i=1}^{N_{−}}$$ and $$\{x_{j}^{+}\}_{j=1}^{N_{+}}$$ are normal and anomalous test clips in the domain $$d$$ in the section $$n$$ in the machine type $$m$$, respectively, and they have been sorted so that their anomaly scores are in descending order. Here, $$N_{−}$$ and $$N_{+}$$ are the number of normal and anomalous test clips in the domain $$d$$ in the section $$n$$ in the machine type $$m$$, respectively.

The reason for the additional use of the pAUC is based on practical requirements. If an ASD system frequently gives false alarms frequently, we cannot trust it. Therefore, it is especially important to increase the true-positive-rate under low FPR conditions. In this task, we will use $$p=0.1$$.

The official score $$\Omega$$ for each submitted system is given by the harmonic mean of the AUC and pAUC scores over all the machine types and all the sections as follows:

$$\Omega = h \left\{ {\rm AUC}_{m, n, d}, \ {\rm pAUC}_{m, n, d} \quad | \quad m \in \mathcal{M}, \ n \in \mathcal{S}(m), \ d \in \{ {\rm source}, {\rm target} \} \right\},$$

where $$h\left\{\cdot\right\}$$ represents the harmonic mean (over all machine types, sections, and domains), $$\mathcal{M}$$ represents the set of the machine types, and $$\mathcal{S}(m)$$ represents the set of the sections for the machine type $$m$$.

As the equations above show, a threshold value does not need to be determined to calculate AUC, pAUC, or the official score because the threshold value is the anomaly scores of normal test clips. However, in real applications, the threshold value must be determined, and a decision must be made as to whether it is normal or abnormal. Therefore, participants are also required to submit the normal/anomaly decision results. The organizers will publish the AUC, pAUC, and official scores as well as the precision, recall, and F1-scores calculated for the normal/anomaly decision results.

Note: The submitted normal/anomaly decision results will not be used for the final ranking because the task organizers do not want to encourage participants to use a forbidden approach (i.e., threshold tuning based on the distribution in the evaluation dataset). Do not use other test clips to determine anomalies for each test clip.

## Ranking

The final ranking will be decided by sorting based on the official score $$\Omega$$.

# Results

Complete results and technical reports can be found in the

# Baseline system

The task organizers provide two baseline systems. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the ASD task.

## Autoencoder-based baseline

This is a simple autoencoder (AE)-based anomaly score calculator and the same as the DCASE 2020 task 2. The anomaly score is calculated as the reconstruction error of the observed sound. To obtain small anomaly scores for normal sounds, the AE is trained to minimize the reconstruction error of the normal training data. This method is based on the assumption that the AE cannot reconstruct sounds that are not used in training, that is, unknown anomalous sounds.

In the baseline system, we first calculate the log-mel-spectrogram of the input $$X = \{X_t\}_{t = 1}^T$$ where $$X_t \in \mathbb{R}^F$$, and $$F$$ and $$T$$ are the number of mel-filters and time-frames, respectively. Then, the acoustic feature at $$t$$ is obtained by concatenating consecutive frames of the log-mel-spectrogram as $$\psi_t = (X_t, \cdots, X_{t + P - 1}) \in \mathbb{R}^D$$, where $$D = P \times F$$, and $$P$$ is the number of frames of the context window. The anomaly score is calculated as:

$$A_{\theta}(X) = \frac{1}{DT} \sum_{t = 1}^T \| \psi_t - r_{\theta}(\psi_t) \|_{2}^{2},$$

where $$r_{\theta}$$ is the vector reconstructed by the autoencoder, and $$\| \cdot \|_2$$ is $$\ell_2$$ norm.

To determine the anomaly detection threshold, we assume that $$A_{\theta}$$ follows a gamma distribution. The parameters of the gamma distribution are estimated from the histogram of $$A_{\theta}$$, and the anomaly detection threshold is determined as the 90th percentile of the gamma distribution. If $$A_{\theta}$$ for each test clip is greater than this threshold, the clip is judged to be abnormal; if it is smaller, it is judged to be normal.

### Parameters

#### Acoustic features

• The frame size for STFT is 64 ms (50 % hop size)
• Log-mel energies for 128 ($$= F$$) bands
• 5 ($$= P$$) consecutive frames are concatenated.
• 640 ($$= D = P \times F$$) dimensions are input to the autoencoder.

#### Network Architecture

• Input shape: 640
• Architecture:
• Dense layer #1
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #2
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #3
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #4
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Bottleneck layer
• Dense layer (units: 8)
• Batch Normalization
• Activation (ReLU)
• Dense layer #5
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #6
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #7
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Dense layer #8
• Dense layer (units: 128)
• Batch Normalization
• Activation (ReLU)
• Output layer
• Dense layer (units: 640)
• Learning (epochs: 100, batch size: 512, data shuffling between epochs)
• Optimizer: Adam (learning rate: 0.001)

## MobileNetV2-based baseline

This is an anomaly score calculator based on machine identification. In the 2020 version, many teams used something like this approach, especially, a lot of them used MobileNetV2 combined with metric learning such as ArcFace. With that in mind, this year the organizers provide a pure MobileNetV2-based baseline.

This baseline identifies from which section the observed signal was generated. In other words, it outputs the softmax value that is the predicted probability for each section. The anomaly score is calculated as the averaged negative logit of the predicted probabilities for the correct section.

We first calculate the log-mel-spectrogram of the input $$X = \{X_t\}_{t = 1}^T$$ where $$X_t \in \mathbb{R}^F$$, and $$F$$ and $$T$$ are the number of mel-filters and time-frames, respectively. Then, the acoustic feature (two-dimensional image) at $$t$$ is obtained by concatenating consecutive frames of the log-mel-spectrogram as $$\psi_t = (X_t, \cdots, X_{t + P - 1}) \in \mathbb{R}^{P \times F}$$, where $$P$$ is the number of frames of the context window. By shifting the context window by $$L$$ frames, $$B (= \lfloor \frac{T - P}{L} \rfloor)$$ images are extracted. The anomaly score is calculated as:

$$A_{\theta}(X) = \frac{1}{B} \sum_{b = 1}^B \log {\LARGE \{} \frac{1 - p_{\theta}(\psi_{t(b)})}{p_{\theta}(\psi_{t(b)})} {\LARGE \}},$$

where $$t(b)$$ is the beginning frame index of the $$b$$-th image, and $$p_{\theta}$$ is the softmax output by MobileNetV2 for the correct section.

To determine the anomaly detection threshold, we assume that $$A_{\theta}$$ follows a gamma distribution. The parameters of the gamma distribution are estimated from the histogram of $$A_{\theta}$$, and the anomaly detection threshold is determined as the 90th percentile of the gamma distribution. If $$A_{\theta}$$ for each test clip is greater than this threshold, the clip is judged to be abnormal; if it is smaller, it is judged to be normal.

### Parameters

#### Acoustic features

• The frame size for STFT is 64 ms (50 % hop size)
• Log-mel energies for 128 ($$= F$$) bands
• 64 ($$= P$$) consecutive frames are concatenated.
• Images of size $$P \times F$$ are input to a network using MobileNetV2.
• The context window is shifted by 8 ($$= L$$) frames (called "hop frames" or "stride").

#### Network Architecture

• Input shape: $$64 \times 128$$ image
• Architecture:
• Triplication layer
• Triplication of the input image to each color channel
• MobileNetV2
• Input: $$64 \times 128 \times 3$$ image
• Output: Softmax for 3 sections (Section 00, 01, and 02)
• Learning (epochs: 20, batch size: 32, data shuffling between epochs)
• Optimizer: Adam (learning rate: 0.00001)

## Repository

Detailed information can be found in the GitHub repository. The directory structure is briefly described here as a reference for label information. When you unzip the files downloaded from the GitHub repository and Zenodo, you can see the following directory structure. As described in the Dataset section, the machine type information is given by directory name, and the section index, domain, and the condition information are given by file name, as:

• /00_train.py
• /01_test.py
• /common.py
• /keras_model.py
• /baseline.yaml
• /00_train.py
• /01_test.py
• /common.py
• /keras_model.py
• /baseline.yaml
• /dev_data
• /fan
• /train (only normal clips)
• /section_00_source_train_normal_0000_.wav
• ...
• /section_00_source_train_normal_0999_.wav
• /section_00_target_train_normal_0000_.wav
• /section_00_target_train_normal_0001_.wav
• /section_00_target_train_normal_0002_.wav
• /section_01_source_train_normal_0000_.wav
• ...
• /section_02_target_train_normal_0999_.wav
• /source_test
• /section_00_source_test_normal_0000.wav
• ...
• /section_00_source_test_normal_0099.wav
• /section_00_source_test_anomaly_0000.wav
• ...
• /section_00_source_test_anomaly_0099.wav
• /section_01_source_test_normal_0000.wav
• ...
• /section_02_source_test_anomaly_0099.wav
• /target_test
• /section_00_target_test_normal_0000.wav
• ...
• /section_00_target_test_normal_0099.wav
• /section_00_target_test_anomaly_0000.wav
• ...
• /section_00_target_test_anomaly_0099.wav
• /section_01_target_test_normal_0000.wav
• ...
• /section_02_target_test_anomaly_0099.wav
• /gearbox (The other machine types have the same directory structure as fan.)
• /pump
• /slider (slider means "slide rail")
• /ToyCar
• /ToyTrain
• /valve
• /eval_data
• /fan
• /train (after launch of the additional training dataset)
• /section_03_source_train_normal_0000_.wav
• ...
• /section_03_source_train_normal_0999_.wav
• /section_03_target_train_normal_0000_.wav
• /section_03_target_train_normal_0001_.wav
• /section_03_target_train_normal_0002_.wav
• /section_04_source_train_normal_0000_.wav
• ...
• /section_05_target_train_normal_0999_.wav
• /source_test (after launch of the evaluation dataset)
• /section_03_source_test_0000.wav
• ...
• /section_03_source_test_0199.wav
• /section_04_source_test_0000.wav
• ...
• /section_05_source_test_0199.wav
• /target_test (after launch of the evaluation dataset)
• /section_03_target_test_0000.wav
• ...
• /section_03_target_test_0199.wav
• /section_04_target_test_0000.wav
• ...
• /section_05_target_test_0199.wav
• /gearbox (The other machine types have the same directory structure as fan.)
• /pump
• /slider (slider means "slide rail")
• /ToyCar
• /ToyTrain
• /valve

After you run the training script 00_train.py and the test script 01_test.py, a csv file for each section index and domain that lists the anomaly scores for each clip will be stored in the directory result/. Also, a csv file for each section index and domain that lists the normal/anomaly decision results for each clip will be stored in the same directory. You can get more detailed information in the GitHub repository.

## Results with the development dataset

We evaluated the AUC and pAUC on the development dataset using several types of GPUs (RTX 2080, etc.). Because the results produced with a GPU are generally non-deterministic, the average and standard deviations from these five independent trials (training and testing) are shown in the following table.

### Detailed results for AE-based baseline

 ToyCarSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)67.63 %61.97 %74.36 %54.50 %64.12 %56.57 %63.19 %62.49 % AUC (Std.)1.21 %1.50 %0.82 %0.89 %1.07 %1.53 %0.80 %0.81 % pAUC (Ave.)51.87 %51.82 %55.56 %50.52 %52.14 %52.61 %52.42 %52.36 % pAUC (Std.)0.50 %0.87 %0.83 %0.20 %0.80 %1.20 %0.25 %0.26 % ToyTrainSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)72.67 %72.65 %69.91 %56.07 %51.13 %55.57 %63.00 %61.71 % AUC (Std.)1.19 %0.32 %0.33 %0.80 %0.53 %1.07 %0.41 %0.44 % pAUC (Ave.)69.38 %62.52 %47.48 %50.62 %48.60 %50.79 %54.90 %53.81 % pAUC (Std.)1.06 %0.88 %0.02 %0.68 %0.13 %0.93 %0.47 %0.40 % fanSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)66.69 %67.43 %64.21 %69.70 %49.99 %66.19 %64.03 %63.24 % AUC (Std.)0.81 %1.12 %1.27 %0.32 %0.48 %1.23 %0.35 %0.36 % pAUC (Ave.)57.08 %50.72 %53.12 %55.13 %48.49 %56.93 %53.58 %53.38 % pAUC (Std.)0.15 %0.42 %0.78 %0.34 %0.38 %1.37 %0.19 %0.21 % gearboxSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)56.03 %72.77 %58.96 %74.29 %72.12 %66.41 %66.76 %65.97 % AUC (Std.)0.53 %0.72 %0.53 %0.51 %1.06 %0.72 %0.36 %0.30 % pAUC (Ave.)51.59 %52.30 %51.82 %55.67 %51.78 %53.66 %52.80 %52.76 % pAUC (Std.)0.16 %0.18 %0.29 %0.97 %0.15 %0.57 %0.28 %0.26 % pumpSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)67.48 %82.38 %63.93 %58.01 %47.35 %62.78 %63.66 %61.92 % AUC (Std.)0.58 %0.27 %0.45 %0.57 %0.53 %0.70 %0.19 %0.18 % pAUC (Ave.)61.83 %58.29 %55.44 %51.53 %49.65 %51.67 %54.74 %54.41 % pAUC (Std.)0.41 %0.77 %0.52 %0.27 %1.46 %0.35 %0.43 %0.46 % slide railSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)74.09 %82.16 %78.34 %67.22 %66.94 %46.20 %69.16 %66.74 % AUC (Std.)0.48 %0.35 %0.16 %0.45 %0.39 %0.77 %0.36 %0.44 % pAUC (Ave.)52.45 %60.29 %65.16 %57.32 %53.08 %50.10 %56.40 %55.94 % pAUC (Std.)0.63 %0.30 %0.55 %0.52 %0.39 %0.31 %0.16 %0.17 % valveSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)50.34 %53.52 %59.91 %47.12 %56.39 %55.16 %53.74 %53.41 % AUC (Std.)0.27 %0.33 %0.34 %0.18 %1.42 %0.22 %0.25 %0.22 % pAUC (Ave.)50.82 %49.33 %51.96 %48.68 %53.88 %48.97 %50.61 %50.54 % pAUC (Std.)0.16 %0.10 %0.52 %0.09 %0.61 %0.04 %0.11 %0.10 %

### Detailed results for MobileNetV2-based baseline

 ToyCarSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)66.56 %71.58 %40.37 %61.32 %72.48%45.17 %59.58 %56.04 % AUC (Std.)2.68 %5.54 %7.19 %5.94 %3.68 %3.36 %1.49 %2.25 % pAUC (Ave.)66.47 %66.44 %47.48 %52.61 %63.99 %48.85 %57.64 %56.37 % pAUC (Std.)5.67 %2.84 %0.23 %2.41 %2.60 %0.94 %1.13 %0.87 % ToyTrainSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)69.84 %64.79 %69.28 %46.28 %53.38 %51.42 %59.16 %57.46 % AUC (Std.)4.39 %3.65 %6.73 %3.85 %2.47 %2.64 %1.02 %1.27 % pAUC (Ave.)54.43 %54.09 %47.66 %51.27 %49.60 %53.40 %51.74 %51.61 % pAUC (Std.)1.65 %1.15 %0.40 %0.73 %0.88 %1.12 %0.53 %0.51 % fanSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)43.62 %78.33 %74.21 %53.34 %78.12 %60.35 %64.66 %61.56 % AUC (Std.)2.35 %1.52 %3.85 %2.03 %4.25 %3.79 %1.25 %1.45 % pAUC (Ave.)50.45 %78.37 %76.80 %56.01 %66.41 %60.97 %64.84 %63.02 % pAUC (Std.)1.15 %2.26 %0.78 %1.38 %7.16 %6.55 %2.36 %2.18 % gearboxSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)81.35 %60.74 %71.58 %75.02 %56.27 %64.45 %68.24 %66.70 % AUC (Std.)1.59 %5.11 %7.16 %2.92 %8.27 %9.67 %4.25 %4.97 % pAUC (Ave.)70.46 %53.88 %62.23 %64.77 %53.30 %55.58 %60.03 %59.16 % pAUC (Std.)3.67 %2.82 %6.67 %2.52 %2.97 %7.90 %3.18 %3.10 % pumpSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)64.09 %86.27 %53.70 %59.09 %71.86 %50.16 %64.20 %61.89 % AUC (Std.)4.34 %3.18 %4.99 %3.08 %5.97 %3.78 %2.54 %2.51 % pAUC (Ave.)62.40 %66.66 %50.98 %53.96 %62.69 %51.69 %58.06 %57.37 % pAUC (Std.)1.90 %5.23 %1.23 %0.93 %2.33 %1.03 %1.40 %1.17 % slide railSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)61.51 %79.97 %79.86 %51.96 %46.83 %55.61 %62.62 %59.26 % AUC (Std.)4.92 %3.70 %1.41 %3.17 %10.65 %5.48 %1.39 %2.04 % pAUC (Ave.)53.97 %55.62 %71.88 %51.96 %52.02 %55.71 %56.86 %56.00 % pAUC (Std.)2.03 %1.57 %4.64 %2.96 %4.17 %2.84 %1.10 %0.86 % valveSection (Domain)0 (Source)1 (Source)2 (Source)0 (Target)1 (Target)2 (Target)Arithmetic meanHarmonic mean AUC (Ave.)58.34 %53.57 %56.13 %52.19 %68.59 %53.58 %57.07 %56.51 % AUC (Std.)4.01 %2.26 %1.96 %3.33 %2.84 %0.55 %1.77 %1.70 % pAUC (Ave.)54.97 %50.09 %51.69 %51.54 %57.83 %50.86 %52.83 %52.64 % pAUC (Std.)4.43 %0.45 %0.32 %1.88 %2.49 %0.84 %1.33 %1.22 %

# Citation

If you are participating in this task or using the MIMII DUE and ToyADMOS2, and baseline code, please cite the following papers:

Publication

Ryo Tanabe, Harsh Purohit, Kota Dohi, Takashi Endo, Yuki Nikaido, Toshiki Nakamura, and Yohei Kawaguchi. MIMII DUE: sound dataset for malfunctioning industrial machine investigation and inspection with domain shifts due to changes in operational and environmental conditions. In arXiv e-prints: 2006.05822, 1–4, 2021.

#### MIMII DUE: Sound Dataset for Malfunctioning Industrial Machine Investigation and Inspection with Domain Shifts due to Changes in Operational and Environmental Conditions

Publication

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. arXiv preprint arXiv:2106.02369, 2021.

#### ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions

Publication

Yohei Kawaguchi, Keisuke Imoto, Yuma Koizumi, Noboru Harada, Daisuke Niizumi, Kota Dohi, Ryo Tanabe, Harsh Purohit, and Takashi Endo. Description and discussion on DCASE 2021 challenge task 2: unsupervised anomalous sound detection for machine condition monitoring under domain shifted conditions. In arXiv e-prints: 2106.04492, 1–5, 2021.