Simple Moving Average Monitor

The Simple Moving Average monitor type assumes that the underlying metric will vary up and down and attempts to tolerate more variation in volatile times while tolerating less in calm periods. It does this by comparing the metric to a time-averaged version of itself: a smoother version. The amount of smoothing is a function of how wide the time window is for averaging. If the deviation from the average is less than a given minimum tolerance, or failing that, if the deviation is within some multiple of the standard deviation of the average, then the value is considered healthy. Greater deviations put the health into either the warning or critical state, depending on which of the thresholds has been crossed.

You have control over the window size, minimum tolerance, multipliers, and alert trigerring percentages for each such monitor.

This is a good choice if you expect a lot of variation in the metric and there is no hard limit. For more information on the concepts see, for example, the Simple Moving Average documentation on Wikipedia.

Consider the image above that shows an SMA monitor for akka_processing_time opened in the editor window. The raw metric data is plotted in blue. The moving average is calculated over a 15 minute time window, making it smoother than the raw data. The size of the averaging window is shown in the legend. The shorter the window, the less the smoothing. The thick translucent white band tracks the moving average. Its thickness is controlled by the tolerance setting. Deviations within the tolerance band are considered healthy.

The pairs of orange and red lines are the warning and critical thresholds. In this case they’re one and two standard deviations from the average respectively. If the minimum tolerance happens to be further out than the standard deviation, the line will follow the minimum threshold instead. Consider the warning (orange) line. If the raw data falls outside the tolerance band and also beyond the warning line, then this is considered a warning event. Similarly for the critical (red) line. Some of these events are indicated in the diagram.

Whether or not these deviations can trigger an alert is a function of the Trigger Occurrence setting. In the example given, this value is set to 50%. This means that at least 50% of the samples in the time window must be triggering.

The health bar shows the health state over time for the monitor. In the example, there is a brief orange bar, indicating a health state of warning. There are several things to note here.

  • The health for the monitor was never considered critical, even though there were events where the metric went outside the critical boundaries. This is because the trigger occurrence was set to 50%. In this case, there were not enough critical events (fewer than 50% in the time window) to warrant considering the health to be critical.
  • The health went to the warning state only after 50% of the health events in the time window were warning.

If the trigger occurrence setting had been set to “once”, the monitor would have been considered warning (or critical) as soon as a single value went beyond the boundary. If it had been set to 100%, then in this case the monitor would have been considered healthy throughout because there was no 15 minute period in which the metric was continuosly beyond the boundary.

Implementation Details

Lightbend Console monitors have a 3-tiered structure.

  1. A model expression based on a recorded metric.
  2. A health expression based on the model output.
  3. An alert expression based on the health output.

Model

Models are usually specific to the individual monitor.

Health

In the case of SMA monitors, the health expression is an aggregate of the model output over a time window, which is then compared to a threshold value representing the trigger occurrence confidence level. The result of the comparison is multiplied by 2 for Warning and by 4 for Critical levels. Most health expressions include a conjunction with the raw model output to filter out missing scrapes. For example:

(avg_over_time(model{label_selector=""}[15m]) >= bool 0.5) * 2 
and model{label_selector=""}

In this example, the aggregate function is avg_over_time, the time window is [15m] and the confidence level is 0.5. The length of the time window and the trigger occurrence confidence level can be set in the UI.

Alert

The alerts are based on simple threshold expressions, which is the same for all default monitors. For example:

health{label_selector=""} > 0