Configuring default monitors

Lightbend Console provides a set of default monitors that are configured to detect common issues with Lightbend open source and commercial products. It is possible to tune the default monitors by providing a custom default monitors configuration file.

Getting configuration files

You can customize default monitors by modifying the config shipped with Lightbend Console. It consists of two files: default-monitors.json and static-rules.yml. The first is the actual default monitor configuration, the second defines prometheus recording rules for metrics that are used by some of the monitors. In most cases modifying default-monitors.json is sufficient. The only reason to modify static-rules.yml is for recording rules that use raw promql to produce custom metrics that will be used by a default monitor. This document will only cover modifying default-monitors.json file. Download default monitor configuration files from a running Lightbend Console installation:

mkdir default-monitors-config
cd default-monitors-config
kubectl get configmap -n lightbend console-api-static-rules -o jsonpath='{.data.static-rules\.yml}' > static-rules.yml
kubectl get configmap -n lightbend console-api-default-monitors -o jsonpath='{.data.default-monitors\.json}' > default-monitors.json
cd ..

Modifying static-rules.yml file

A metric available in prometheus is not always a good input to a monitor. For example, you might want to monitor rate of increase when the metric is a counter or you might want to filter out samples with specific labels. By modifying static-rules.yml file it is possible to define new metrics based on existing ones. All prometheus recording rules defined in static-rules.yml are accessible by monitors. A custom recording rule looks like this:

- record: prometheus_rule_evaluation_failures_rate
  expr: irate(prometheus_rule_evaluation_failures_total[5m])

For more info on how to make custom prometheus recording rules look in prometheus docs.

Modifying default-monitors.json file

The default-monitors.json file consists of a list of monitors, each being one of three types - threshold, growth and simple moving average. More detailed descriptions of monitors and their type can be found in Console and Monitor Overview. Following in this page you can find an example JSON syntax for each monitor type with field descriptions.

Common monitor fields

All the monitor types share these fields:

  • monitorVersion: must be “1”
  • model: one of “threshold”, “growth” or “sma”
  • parameters.metric: underlying prometheus metric or a static rule defined in static-rules.yml
  • parameters.summary: short summary of the monitor, used when alerting
  • parameters.description: templated description of the condition when monitor is unhealthy, used when alerting
  • parameters.confidence: confidence ratio of unhealthy/total samples inside the window that is needed to declare the monitor unhealthy; must be one of “5e-324” (means at least one sample), “0.25”, “0.5”, “0.75”, “0.95”, “1”
  • parameters.filters: monitor will only use samples that match this list of prometheus metric labels and their values
  • parameters.severity: one or both of “warning”, “critical”; inside each are monitor type specific parameters described below

Threshold monitor

"server_5xx": {
    "monitorVersion": "1",
    "model": "threshold",
    "parameters": {
        "metric": "http_server_responses_5xx_rate",
        "window": "5m",
        "confidence": "1",
        "severity": {
            "warning": {
                "comparator": ">",
                "threshold": "0"
            }
        },
        "summary": "HTTP 5xx errors",
        "description": "HTTP server at {{$labels.instance}} has 5xx errors"
    }
}
  • parameters.window: time window for calculating health, used together with the common parameters.confidence
  • parameters.severity: one or both of “warning”, “critical”
  • parameters.severity.warning.comparator: operator for comparing threshold to the metric value, one of “<”, “>”, “<=”, “>=”, “==”, “!=”
  • parameters.severity.warning.threshold: value to compare the metric against

Note that comparator and threshold follows the same syntax inside critical severity description too.

Growth monitor

"task_queue_growth": {
    "monitorVersion": "1",
    "model": "growth",
    "parameters": {
        "metric": "task_queue_length",
        "filters": {
            "quantile": "0.5"
        },
        "period": "15m",
        "minslope": "0.1",
        "confidence": "1",
        "severity": {
            "critical": {
                "window": "5m"
            }
        },
        "summary": "task queue growing",
        "description": "node {{$labels.instance}} has a growing task queue"
    }
}
  • parameters.period: period used for calculating linear regression of the underlying metric
  • parameters.minslope: if linear regression line slope exceeds this, monitor is considered unhealthy
  • parameters.severity.critical.window: time window for calculating health, used together with the common parameters.confidence; same as parameters.window in threshold and sma monitors, however growth monitors have separate windows for warning and critical severities

Underlying prometheus metric task_queue_length is assumed to be a histogram of queue sizes aggregated by quantiles, so a filter is used to get median length.

SMA monitor

"task_throughput": {
    "monitorVersion": "1",
    "model": "sma",
    "parameters": {
        "metric": "task_consume_rate",
        "period": "15m",
        "minval": "1000",
        "window": "15m",
        "confidence": "1",
        "severity": {
            "warning": {
                "numsigma": "3"
            }
        },
        "summary": "task throughput is anomalous",
        "description": "{{$labels.es_workload}} has unusual task throughput"
    }
}
  • parameters.window: time window for calculating health, used together with the common parameters.confidence
  • parameters.period: simple moving average window
  • parameters.minval: minimum deviation from the sma required before considering the monitor unhealthy
  • parameters.severity.warning.numsigma: the monitor is considered unhealthy if the metric value exceeds numsigma standard deviations from the simple moving average over the period

Note that numsigma standard deviation follows the same syntax inside critical severity description too.

Creating ConfigMap & Configuring Lightbend Console

Once default-monitors.json and static-rules.yml files are modified, create a Kubernetes ConfigMap:

kubectl -n lightbend create configmap my-default-monitors-config --from-file=default-monitors-config/ --dry-run -o yaml | kubectl apply -f -

Now tell Lightbend Console to use the newly created ConfigMap for default monitors by setting defaultMonitorsConfigMap value in your values.yaml:

consoleAPI.defaultMonitorsConfigMap: my-default-monitors-config
consoleAPI.staticRulesConfigMap: my-default-monitors-config

Then install console using lbc.py, as described in the installation guide.