Threshold Monitor

A Threshold Monitor is a straightforward monitor. When the metric violates the threshold condition for a sufficient percentage of the scrapes in the time window, a corresponding alert will be triggered. Typically a threshold monitor will be considered unhealthy when the metric exceeds a certain value. Other triggering conditions are also available, such as when the metric drops below a given value.

In this case the metric crossed the warning threshold several times but the health didn’t change to warning every time. The trigger occurrence value was set to 25% and the time window to 5 minutes. In order for the health to go to the warning level, at least 25% of the instantaneous health values (generated with each scrape) in the preceeding 5 minute window had to be at the warning level. This is an example of how to ignore brief violations. If the trigger occurrence value had been set to “once”, then the health would go to warning every time the metric violated the threshold. (ie. very sensitive) With a setting of “100%”, all instantaneous health values in the period have to be in violation for the health to go to warning. (ie. insensitive) This is appropriate if you only care about sustained violations.

This monitor type is appropriate if you have a known hard limit for some metric. It works with an underlying Prometheus Gauge metric.

Implementation Details

Lightbend Console monitors have a 3-tiered structure.

  1. A model expression based on a recorded metric.
  2. A health expression based on the model output.
  3. An alert expression based on the health output.

Model

In the case of threshold monitors, the model expression is in the form of a metric with some label selectors compared to a threshold value. For example:

kube_pod_ready{label_selector=""} < bool 1

The comparison operator and the threshold value can be set in the THRESHOLD drop down and value field in the UI.

Health

In the case of threshold monitors, the health expression is an aggregate of the model output over a time window, which is then compared to a threshold value representing the trigger occurrence confidence level. The result of the comparison is multiplied by 2 for Warning and by 4 for Critical levels. Most health expressions include a conjunction with the raw model output to filter out missing scrapes. For example:

(avg_over_time(model{label_selector=""}[5m] ) >= bool 1) * 4 
and model{label_selector=""}

In this example, the aggregate function is avg_over_time, the time window is [5m] and the confidence level is 1. The length of the time window and the trigger occurrence confidence level can be set in the UI.

Alert

The alerts are based on simple threshold expressions, which is the same for all default monitors. For example:

health{label_selector=""} > 0