Monitoring Architecture

This section describes the architecture of the monitoring system, including metric discovery and monitor transformations. This should be useful for:

  • Operators to understand how the monitoring integrates with their existing systems.
  • Power users that would like to use the underlying Prometheus metrics directly.

Component Architecture

Frontend

The user interacts with nginx, which serves the single page application (SPA). This is the Console UI and runs in the user’s browser.

nginx acts as a reverse proxy, providing a convenient single point of entry. Users are able to view the UIs of the components in the diagram above via nginx.

The SPA interacts with Prometheus and the Monitor API (also via the nginx proxy).

Backend

The Monitor API provides a CRUD interface for managing the Console monitors. It also manages Prometheus - it translates monitors directly into Prometheus configuration. The monitors are stored on a persistent disk in a git repository. The Monitor API is not intended to be used directly, and its API may change in the future.

Prometheus is the main monitoring and metrics component, which pulls in data from metric endpoints of user applications and metric exporters such as kube-state-metrics. Prometheus has built in service discovery mechanisms for finding the endpoints.

Alertmanager/Grafana/kube-state-metrics are installed by Console to provide a complete user experience.

Data Flow

The flow of metrics is described below.

(1) Service Discovery

Prometheus scrapes applications for metrics. It discovers the endpoints to use via Kubernetes service discovery.

(2) relabel_config

As part of scraping, Prometheus uses a relabel_config to label the incoming metrics. It adds labels that Console depends on, such as es_workload.

(3) Prometheus

The scraped metrics are stored in Prometheus as time series. Time series are unique on the set of labels.

The Monitor API adds recording rules to Prometheus for each monitor. The recording rules generate new time series in Prometheus. These are transformations of the underlying metric - first a model transform, then a health transform.

(4) Model Transform

The model transform generates a boolean time series which indicates if there is a statistically significant change. For example, a threshold monitor will generate true when the metric value exceeds a bound at any point in time.

(5) Health Transform

The health transform generates another boolean time series, which indicates whether the metric is healthy or not over some time window. This depends on the model transform created in (4). For example, a threshold monitor may become unhealthy if it exceeds a bound over a 10 minute window.

Alerts are generated directly from the health time series. When the health time series becomes unhealthy, an alert will be generated. If Alertmanager is configured, the alerts will be sent immediately to Alertmanager. Alertmanager itself can have rules to aggregate, delay, and suppress alerts. In addition, Alertmanager handles routing of alerts to external integrations, such as PagerDuty and Slack.

Further notes

The transforms (4 and 5) are generated based on a single monitor definition, which is either created in the Console UI or the default monitors file passed to the Monitor API at startup.

Both (4) and (5) produce time series in Prometheus that can be queried and graphed in Grafana.