Anomaly detection Engine for Linux Logs (ADE)

How ADE analyzes a Linux system to find anomalies

Using unsupervised machine learning,  ADE extracts and organizes message data to build a model of behavior for each Linux model group.  ADE use the model for the model group which contains the Linux system to compare the expected behavior with the actual behavior and flag the difference as anomalies.  ADE analyzes each message within a time slice (interval)  to determine how different the behavior of the message(s) within the interval are from expected.  It then totals the differences for all the messages within an interval and compares this value with the normal value from the model to calculate interval anomaly score. 

Time slice (interval)

To produce meaningful analysis results for a monitored system, ADE divides the log into time slices. These time slices are called the analysis interval, the length of which varies depending on the volume of message traffic.  For Linux systems, which tend to produce lower volume and less consistent message traffic, the default analysis interval length in the flowlayout.xml is 60 minutes.

To display and record the results of analysis intervals,  ADE produces an analysis snapshot every 10 minutes for each monitored system. Each analysis snapshot is a point-in-time record of the anomaly score for an analysis interval. For example:

Model Group

Because the message traffic on Linux systems often can be relatively light, and because Linux images are typically configured in pools of dynamically activated images, ADE is designed to provide
analysis results for Linux systems through the use of model groups. Through model groups, multiple systems contribute to the generation of a single model for the group; the more systems in the group, the more data ADE can use to build the model.

Defining model groups and their member systems

A model group is a collection of one or more systems that handle the same type of workload, and thus can be expected to exhibit similar behavior. When considering Linux systems to group together in a single model group, use the following guidelines:

Building a model for a model group

ADE builds one model for a group of Linux systems with similar workloads, and uses that model to compare to current syslog data from each system in the group. To build a robust model of Linux
system behavior, ADE generally needs a minimum of 120 days of message data. Analysis can begin, however, as soon as the system data that is available for training meets the criteria for building a
valid model.

Measuring behavior of an Interval

ADE provides four measure of the how unusual the interval is
  1. Number of unique message ids
  2. Interval anomaly score
  3. Number of messages not in the model
  4. Number of message  which have not been seen by analyze

Number of unique message ids

Interval anomaly scores

The interval anomaly score indicates the difference in current behavior compared to the expected behavior that is reflected in the model. If the analysis interval contains messages that are relatively
normal, common messages for that system, ADE assigns a low score to the analysis interval and low score to the analysis snapshot. For example, suppose that you have analyzed a relatively stable test
system. On this test system, various daemons, are recycled on a regular basis. This behavioral pattern is reflected in the model that ADE uses for analysis. When a current daemon recycle completes normally, the intervals for daemon recycling receive a low interval anomaly score, because the pattern of messages issued during a successful recycle match an expected behavior in the model. However, if any unexpected messages are issued during a current daemon recycle, ADE assigns a higher interval anomaly score to those analysis intervals that contain the unexpected or unique messages. 

The possible interval anomaly scores are:

0 through 99.4
The analysis interval contains messages and message clusters that match or exhibit relatively insignificant differences in expected behavior, as defined in the ADE model. A score of 0 is possible because ADE eliminates all expected, in-context messages from its scoring calculation. A score of 0 indicates intervals that exhibit no difference in behavior compared to the  group model.  Analysis intervals with scores that are greater than 0 but less than 99.5 contain some messages that are unexpected or issued out of context. Scores in this range indicate intervals that do not vary significantly from the system model.  Analysis intervals with this score contain some rarely seen, unexpected, or out-of-context messages. Generally speaking, this score indicates analysis intervals with some differences from the system or group model but do not contain messages of much diagnostic value.
99.6 - 100
Analysis intervals with this score contain rarely seen messages (these messages appear in the model only once or twice), or many messages that are unexpected or issued out of context. This score indicates analysis intervals with more differences from the group model; these intervals can contain messages that might help you diagnose anomalous system behavior. 
101 
Analysis intervals with this score exhibit the most significant differences from the group model; these intervals contain messages that merit investigation. ADE assigns this score to analysis intervals
that contain:

Messages not in the model

In the message traffic for a Linux system, ADE detected one or more messages that are not in the current model that is in use for analysis. These messages might have been issued by the Linux system before, and therefore might have been included in previous models, but are not in the current model. In the Anomaly Detection Engine Interval View, the entry for this type of new message displays one of the following values in the Periodicity Status column:
For values other than NEW, a date and time is displayed in the Last Issued column.

Message new to analyze

In the message traffic for a specific Linux system, ADE detected one or more messages that the system has issued for the first time since the day on which the ADE began reporting analysis results for this system. In the Anomaly Detection Engine Interval View, the entry for this type of new message displays the following attributes:

Measuring behavior of a message 

A message anomaly score is created by comparing the patterns of message traffic observed during the interval being analyzed with the expected pattern of message traffic observed during all the intervals that are within the training period.   

In summary, through the training process, the ADE learns about expected message patterns, and stores this information as part of the model for a specific  group. ADE uses this model data to determine interval and message scores when it analyzes the data from a Linux system..