Console and Monitor Overview

Lightbend Console provides the following information:

  • Cluster page—displays an overview of the workloads and pods in the cluster in the Cluster Pod Map and the Workloads table.
  • Workloads page—lists the monitors for a given workload and shows their health.
  • Monitors page—details a monitor’s attributes, metrics, and current health, allows you to create new monitors and edit existing ones.
  • Grafana Dashboards—graphs of all the metrics backing the monitors of a workload as well as KPI’s for the workload (based on its service type).
  • Lightbend Telemetry Dashboards—pre-configured Grafana dashboards that display Lightbend Telemetry metrics.

Navigating the Console

Navigate the Console by drilling down from the cluster to workloads to monitors or open the Grafana dashboard:

  • From the Cluster page click a workload, either in the table or the map, to open the Workload page that shows all monitors and some workload details.
  • From the Workload page, click a monitor to open the Monitor page.
  • From the left panel of any page, click the Grafana icon to open the Grafana dashboard.
  • To return to the Cluster page, click the Lightbend Console logo in the top left or in the breadcrumbs.

Monitors

In general terms, the Lightbend Console monitors Prometheus metrics so that you can:

  • Track the health of the workload (i.e. the application) being monitored.
  • Get alerts when the metric becomes unhealthy.

A monitor in Lightbend Console can be thought of as a specification for generating health and alert rules for a particular Prometheus metric.

A monitor definition includes:

  • a name
  • a workload name
  • a Prometheus metric
  • a monitor type (e.g. threshold, simple moving average, growth rate)
  • a function and parameters appropriate for the monitor type

The predefined monitor types are threshold, simple moving average (sma), and growth. These are described in detail below.

A monitor is used to generate a set of health and alert recording rules in Prometheus, and there are often multiple sets per monitor. There might be a separate set for each pod generating that Prometheus metric for example. The Console comes with several predefined monitors (e.g. Kubernetes-related and Akka-related monitors) all with reasonable default configurations. As required, the Console allows a user to create or delete monitors, and modify associated parameters, resulting in health and alert rules tuned for their particular application.

Health and Alerts

With every Prometheus scrape, we get new metric data and that data is tagged. The tags qualify where the data applies. They may indicate which pod, or application the data comes from for example. A monitor is tied to a workload (typically the same as application but not necessarily). The monitor is used to generate health (and alert) time series for the associated workload. A single monitor definition may thus be used to generate multiple health (and alert) time series, depending on how the data labels are partitioned.

For example, for a given monitor there may be separate health/alert rules for each pod in the workload. All those time series would share a workload and Prometheus metric (as associated with the monitor) but differ on the pod tag.

The monitor rule compares some function of the metric values against thresholds to determine instantaneous health values. We calculate both a warning and critical instantaneous health value for each health series.

The health status for a health series is based on how the instantaneous values trend over time. For a given time window and severity (e.g. 15 minutes and warning), some percentage of the instantaneous health values may be of the given severity. If that percentage is over another threshold (e.g. 50%), then the health will be said to have a state of severity (e.g. warning) in that period. This calculation also happens with every scrape.

Finally, one can specify if an alert should be triggered based on the health. Separate alerts can be created for the warning and critical health states. What it means to “alert” is managed by your alertmanager configuration.

Example

Assume you have an application with two actors–producer and consumer–both implemented in Akka. Also assume you had two producer pods and three consumer pods. One aspect of this scenario that you might want to monitor is the processing time for requests to the consumer actors.

There are default monitors defined for Akka apps, and one of them is akka_processing_time. It’s based on the akka_actor_processing_time_ns Prometheus metric and is of the simple moving average type. Corresponding health and alert time series will therefore be created automatically for the actors.

By default, there will be health and alert time series for each combination of actor and pod. The Console would allow you to group by pod only if desired.

The monitor parameters could be set to consider the health to be in a warning state if processing time went beyond 1 standard deviation of the average. The UI would allow you to see if/when this is happening for each pod. (This would suggest things are starting to get much slower.)

An alert could be configured to to trigger if the health goes into the warning state for 75% of the samples in a 5 minute period.

Monitor types

The Console comes with a predefined set of basic monitor types:

The following sections describe the workings of each monitor type. See Editing Monitors to understand how to tune monitor parameters to obtain the desired level of monitoring and alerting.

Threshold Monitor

A Threshold Monitor is a straightforward monitor. When the metric violates the threshold condition for a sufficient percentage of the scrapes in the time window, a corresponding alert will be triggered. Typically a threshold monitor will be considered unhealthy when the metric exceeds a certain value. Other triggering conditions are also available, such as when the metric drops below a given value.

In this case the metric crossed the warning threshold several times but the health didn’t change to warning every time. The trigger occurrence value was set to 25% and the time window to 5 minutes. In order for the health to go to the warning level, at least 25% of the instantaneous health values (generated with each scrape) in the preceeding 5 minute window had to be at the warning level. This is an example of how to ignore brief violations. If the trigger occurrence value had been set to “once”, then the health would go to warning every time the metric violated the threshold. (ie. very sensitive) With a setting of “100%”, all instantaneous health values in the period have to be in violation for the health to go to warning. (ie. insensitive) This is appropriate if you only care about sustained violations.

This monitor type is appropriate if you have a known hard limit for some metric. It works with an underlying Prometheus Gauge metric.

Simple Moving Average Monitor

The Simple Moving Average monitor type assumes that the underlying metric will vary up and down and attempts to tolerate more variation in volatile times while tolerating less in calm periods. It does this by comparing the metric to a time-averaged version of itself: a smoother version. The amount of smoothing is a function of how wide the time window is for averaging. If the deviation from the average is less than a given minimum tolerance, or failing that, if the deviation is within some multiple of the standard deviation of the average, then the value is considered healthy. Greater deviations put the health into either the warning or critical state, depending on which of the thresholds has been crossed.

You have control over the window size, minimum tolerance, multipliers, and alert trigerring percentages for each such monitor.

This is a good choice if you expect a lot of variation in the metric and there is no hard limit. For more information on the concepts see, for example, the Simple Moving Average documentation on Wikipedia.

Consider the image above that shows an SMA monitor for akka_processing_time opened in the editor window. The raw metric data is plotted in blue. The moving average is calculated over a 15 minute time window, making it smoother than the raw data. The size of the averaging window is shown in the legend. The shorter the window, the less the smoothing. The thick translucent white band tracks the moving average. Its thickness is controlled by the tolerance setting. Deviations within the tolerance band are considered healthy.

The pairs of orange and red lines are the warning and critical thresholds. In this case they’re one and two standard deviations from the average respectively. If the minimum tolerance happens to be further out than the standard deviation, the line will follow the minimum threshold instead. Consider the warning (orange) line. If the raw data falls outside the tolerance band and also beyond the warning line, then this is considered a warning event. Similarly for the critical (red) line. Some of these events are indicated in the diagram.

Whether or not these deviations can trigger an alert is a function of the Trigger Occurrence setting. In the example given, this value is set to 50%. This means that at least 50% of the samples in the time window must be triggering.

The health bar shows the health state over time for the monitor. In the example, there is a brief orange bar, indicating a health state of warning. There are several things to note here.

  • The health for the monitor was never considered critical, even though there were events where the metric went outside the critical boundaries. This is because the trigger occurrence was set to 50%. In this case, there were not enough critical events (fewer than 50% in the time window) to warrant considering the health to be critical.
  • The health went to the warning state only after 50% of the health events in the time window were warning.

If the trigger occurrence setting had been set to “once”, the monitor would have been considered warning (or critical) as soon as a single value went beyond the boundary. If it had been set to 100%, then in this case the monitor would have been considered healthy throughout because there was no 15 minute period in which the metric was continuosly beyond the boundary.

Growth Rate Monitor

The Growth Rate Monitor health function is based on the deriv() Prometheus function which is defined:

deriv(v range-vector) calculates the per-second derivative of the time series in a range vector v, using simple linear regression.

Note that the deriv() function, and thus this monitor type, is intended to be used only with Gauge metrics. Note also that the word rate in this discussion of Growth Rate Monitors is meant in the general sense and is unrelated to the rate() Prometheus function.

The health of a Growth Rate Monitor is based on the derivative of the underlying metric. At each point in time, a per-second derivative value is determined from the slope of the linear regression generated using all the values of the preceding time window (the Rate Window in the Console). If the derivative exceeds a given threshold (the Rate Threshold in the Console) the metric is considered unhealthy. If this condition persists for some time ( Sustain at least in the Console), then an alert is triggered. Warning and critical alerts have individual sustain periods.

Consider the monitor in the image above. The fitted derivative is displayed with the dashed white line. This behaves in a fashion similar to the smoothed average of the SMA monitor type. The longer the rate window, the smoother the growth rate graph will be.

The scale for the rate line is provided on the right-side y-axis of the graph. The threshold value is shown with the magenta line. In this case, even though the rate was over the threshold on several occasions, it was only over long enough (one minute, as determined by the sustain at least setting) for the health to go into the warning state briefly around the 16:15 mark.

This monitor type is a good choice if you want if you want to be alerted on significant trend changes in the underlying metric. Consider a metric for the number of messages in an input queue in front of some service. Fluctuations will be expected over the day. With the onset of heavy load, the number will go up with a predictable rate of change, eventually hitting a plateau and going down as the service catches up and the input queue is drained. If the metric starts going up too fast though, this might indicate problems with the service or perhaps the onset of a DOS attack. A Growth Rate Monitor would be appropriate in this case.