Creating the Training Data Set

To train your model to detect anomalies, you need to teach it what constitutes an anomaly for your set of properties. Your training data consists of a CSV file where each row lists values for the properties you want to use, along with a boolean indicating whether these values constitute an anomaly.

For this tutorial, we created a training data set from generated historical sensor data where the combinations of the three sensor values are labeled as normal or anomaly. The training data is a CSV file with headers containing the names of the 3 properties. The headers should match the names of the properties defined in the digital twin model. The order of properties does not matter.

The column containing the boolean representing whether the row is an anomaly should be named Anomaly.

The dataset for this tutorial contains the three digital twin state properties Temperature, Friction, and RPM, as well as the value for Anomaly.

Note

For best results in binary classification, the training data set should be large, and contain a balanced set for both classifications (normal and anomaly).

Below is what the training data file looks like:

anomaly02

Note

The training dataset is available on the ScaleOut DigitalTwinSamples GitHub repository.