Growth Rate Monitor

The Growth Rate Monitor health function is based on the deriv() Prometheus function which is defined:

deriv(v range-vector) calculates the per-second derivative of the time series in a range vector v, using simple linear regression.

Note that the deriv() function, and thus this monitor type, is intended to be used only with Gauge metrics. Note also that the word rate in this discussion of Growth Rate Monitors is meant in the general sense and is unrelated to the rate() Prometheus function.

The health of a Growth Rate Monitor is based on the derivative of the underlying metric. At each point in time, a per-second derivative value is determined from the slope of the linear regression generated using all the values of the preceding time window (the Rate Window in the Console). If the derivative exceeds a given threshold (the Rate Threshold in the Console) the metric is considered unhealthy. If this condition persists for some time ( Sustain at least in the Console), then an alert is triggered. Warning and critical alerts have individual sustain periods.

Consider the monitor in the image above. The fitted derivative is displayed with the dashed white line. This behaves in a fashion similar to the smoothed average of the SMA monitor type. The longer the rate window, the smoother the growth rate graph will be.

The scale for the rate line is provided on the right-side y-axis of the graph. The threshold value is shown with the magenta line. In this case, even though the rate was over the threshold on several occasions, it was only over long enough (one minute, as determined by the sustain at least setting) for the health to go into the warning state briefly around the 16:15 mark.

This monitor type is a good choice if you want if you want to be alerted on significant trend changes in the underlying metric. Consider a metric for the number of messages in an input queue in front of some service. Fluctuations will be expected over the day. With the onset of heavy load, the number will go up with a predictable rate of change, eventually hitting a plateau and going down as the service catches up and the input queue is drained. If the metric starts going up too fast though, this might indicate problems with the service or perhaps the onset of a DOS attack. A Growth Rate Monitor would be appropriate in this case.

Implementation Details

Lightbend Console monitors have a 3-tiered structure.

  1. A model expression based on a recorded metric.
  2. A health expression based on the model output.
  3. An alert expression based on the health output.

Model

TODO

Health

In the case of growth rate monitors, the health expression is an aggregate of the model output over a time window, which is then compared to a threshold value representing the trigger occurrence confidence level. The result of the comparison is multiplied by 2 for Warning and by 4 for Critical levels. Most health expressions include a conjunction with the raw model output to filter out missing scrapes. For example:

(avg_over_time(model{label_selector=""}[5m]) >= bool 1) * 4 
and model{label_selector=""}

In this example, the aggregate function is avg_over_time, the time window is [5m] and the confidence level is 1. The length of the time window and the trigger occurrence confidence level can be set in the UI.

Alert

The alerts are based on simple threshold expressions, which is the same for all default monitors. For example:

health{label_selector=""} > 0