Kubernetes

kube_pod_not_ready

This monitor alerts when a pod’s readiness probe fails.

Readiness probes are application specific. They are used by Kubernetes Services to determine if the pod should be sent new connections. They are also used to indicate whether an application is healthy.

Failure Examples

  • Application is overloaded, so its readiness probe fails.
  • Application is crashing, so its readiness probe fails.
  • Application is restarting and stuck in CrashLoopBackoff, so its readiness probe has not been executed.

Suggested Actions

Look at the pod’s logs and the Kubernetes events to determine why the readiness probe is failing.

Implementation

Model

The base model is a simple expression:

kube_pod_ready{es_monitor_type=""} < bool 1

Where kube_pod_ready is a recording rule:

kube_pod_status_ready{condition="true"}
unless on(pod) avg by(pod) (kube_pod_container_status_terminateerbd) == 1

Tuning

The main parameter to tune is the window length and occurrence. Currently the monitor is quite sensitive as it’s intended to be a lively representation of the readiness of a pod. Increasing the window length and/or confidence will make it less sensitive.

kube_container_restarting

This monitor alerts when a Kubernetes container is restarting.

Failure Examples

  • JVM application is being killed due to out-of-memory error every 30 seconds. It restarts and resumes running until the next time it gets killed due to being out-of-memory.
  • Bug in application causes it to crash every 30 seconds.

Suggested Actions

Check the logs of your application for reasons for crashes.

Implementation

This is a threshold monitor based on a metric that captures the restart rate of the Kubernetes containers. The rate is calculated over a 1-minute window. The health model with the default settings requires the rate to be non-zero for at least 25% of the time over a 15-minute window. This handles pods that go into CrashLoopBackoff, which has up to a 5 minute backoff time.

Model

The base model is a simple expression:

kube_pod_container_restarts_rate{es_monitor_type=""} > bool 0

Where kube_container_restarts_rate is a recording rule:

kube_pod_container_status_restarts_total[1m]

kube_container_restarts_rate is a metric generated by a custom recording rule and is a conjunction of two base metrics:

rate(kube_pod_container_status_restarts_total[1m]) >= 0
unless on(pod) avg by(pod) (kube_pod_container_status_terminated) == 1

The rate window can be changed in static-rules.json.