Anomaly detection Engine for Linux Logs (ADE)

Why run ADE

To answer the question. Are your systems behaving badly?

Many everyday activities can introduce system anomalies and initiate failures in complex, integrated data centers; these activities include:
You can use a combination of existing system management tools to determine whether any of these activities is causing one or more systems to behave abnormally, but none can detect every possible combination of change and failure. Even when using these tools, you might have to look through
message logs to help solve the problem but the sheer volume of messages can make this task a daunting one.

ADE helps you look through the massive volumes of  log data to find the  portions of the log to focus on for further detailed review.

Running ADE

To run ADE to detect anomalies in Linux logs requires the following manual steps to understand the problem before ADE is run


Basic approach


The basic approach is

Pick data to prime the model


For ADE to create a model that will generate useful analytic results, there needs to be a sufficient number of unique message ids (message keys). Because ADE uses unsupervised learning, ADE does not require the user to label either messages or intervals, it requires that the systems being analyzed are “relatively” stable.

Almost any Linux system that is used to support production will be stable enough for ADE to find anomalies.

Determining how to group systems into "model groups"


ADE supports grouping similar systems together when building the model. Here is an example of eleven servers and one way you can assign them to model groups:

Examine a time period to find if unusual behavior occurred during that time period - root cause analysis


Prime the database with Linux logs

To prime the ADE database:

  1. Delete any information left in the database controldb delete
  2. Identify when the potential anomaly occurred
  3. Load Linux logs from time period immediately before that time period upload -f filename or directoryname

  4. Check if sufficient data has been loaded verify model group name

    • If there is sufficient information then proceed to training
    • else
      • If there are additional logs from before the set of logs loaded then load additional Linux logs
      • If there aren't any more logs available then reduce the number of model groups. Using the example above, if the problem is with model group 2 consider combining model group 2 and model group 3
      • If after loading additional logs and simplifying the model group structure verify still indicates there is insufficient information then, try training but remember that the results may be questionable.

Create a model of normal behavior from a set of Linux logs

To create a model of the normal behavior of the Linux systems:

Analyze additional logs to detect anomalies

To analyze the time period for anomalies

Examine the results written to the file system using your favorite web browser

To examine the results for a period, point your web browser at the index.xml file for the time period and system of interest. To select a specific interval for further review

The xslt provided will display a summary of the period(day).

To examine a specific ten minute interval point your web browser at the interval_nnn.xml for the time period, interval, and system of interest. The xslt provided will display a summary of the interval.

The following samples illustrate how the analysis output is written to files using the defaults specified in setup.props:

Continuously process Linux Logs so anomaly information is always available


To set up ADE to provide continuous analysis results, the following steps need to be scheduled to run automatically:

If there are duplicate time periods in the logs, ADE will overlay the existing time period in the database with the time period being added by either upload or analyze.

Prime the database with Linux logs

To prime the ADE database:

Create a model of normal behavior from a set of Linux logs

To create a model of the normal behavior of the Linux systems:

Analyze additional logs to detect anomalies

Routinely analyze the available logs so anomalies information is available when needed

Examine the results written to the file system using your favorite web browser

To examine the results for a period, point your web browser at the index.xml file for the time period and system of interest. To select a specific interval for further review

The xslt provided will display a summary of the period(day).

To examine a specific ten minute interval point your web browser at the interval_nnn.xml for the time period, interval, and system of interest. The xslt provided will display a summary of the interval.

After the automation has been running for a few days, you will probably want to make sure that it is generating the appropriate results.

Delete no longer needed results

After training has run, use standard linux commands to delete the results that are no longer valuable. For example, you could choose to delete all the ADE results which are older than one year.