Akka

Akka metrics come from Lightbend Telemetry (Cinammon), and describe performance of actors and http endpoints.

akka_inbox_queue_time

Akka applications have actors, and each actor has a queue for incoming messages from other actors. It is called “mailbox”. The akka_inbox_queue_time monitor alerts if time that messages spend in the queue is growing. An increase in queuing time indicates the actor is unable to keep up with the number of messages being received.

Failure Examples

  • There are more incoming messages than the actor can keep up with. This leads to growing mailbox size and growing mailbox time.
  • Processing time for a average message has increased, leading to longer time messages spend waiting in the queue.

Suggested Actions

Look at akka_processing_time monitor or the underlying telemetry metric akka_actor_processing_time_ns. If it also shows growth, then each message is taking more time to process. This could be due to increased work complexity or over-utilization of CPU resources on the node. If message processing time hasn’t increased, then it means there are more messages coming in. If that is expected, consider distributing work among more actors or increasing available CPU resources.

Implementation

This is a growth monitor based on the measure of 99th percentile of mailbox time provided by Lightbend Telemetry akka_actor_mailbox_time_ns.

Note that the underlying metric is a Prometheus Summary with a 10 minute sliding window, so the percentile will be over all requests in the last 10 minutes.

akka_processing_time

Akka actors take one message at a time from their mailbox and process that message. The akka_processing_time_growth monitor warns if that processing time is growing.

Failure Examples

  • Actor needs to communicate with external service in order to process the message and that service is down. This leads to time-out errors and increased processing time.
  • Amount of work for each message has increased (eg. more users in the database leads to more expensive join operations).
  • Too many containers running on the node leads to over-subscription of CPU resources and increased processing time.

Suggested Actions

  • Check for errors in all services that actor needs to contact in order to process the message.
  • Make sure enough CPU resources are allocated for pods that run actor systems.

Implementation

This is a growth monitor based on the measure of 99th percentile of processing time as provided by Lightbend Telemetry akka_actor_processing_time_ns.

Note that the underlying metric is a Prometheus Summary with a 10 minute sliding window, so the percentile will be over all requests in the last 10 minutes.

akka_http_server_response_time

This monitor warns if the 99th percentile to service requests is increasing, from the akka http server perspective.

Failure Examples

  • Database reads are going slower, causing request processing to slow down.
  • The server is overloaded and unable to service the increasing number of requests.

Suggested Actions

  • Investigate the telemetry akka-http metrics to see what’s going on. Check if the number of requests is increasing.
  • Investigate CPU and memory usage to see if anything strange is happening.

Implementation

This is a growth rate monitor based on the measure of the time it takes for the server to respond as provided by Lightbend Telemetry metric akka_http_http_server_response_time_ns.

Note that the underlying metric is a Prometheus Summary with a 10 minute sliding window, so the percentile will be over all requests in the last 10 minutes.

akka_http_client_response_time

This monitor warns if the 99th percentile of time to get a response from an external service is increasing, from the point of view of the akka-http client.

Failure Examples

  • External service is slowing down.
  • An increase in user traffic is causing more load and more requests to be generated.

Suggested Actions

  • Inspect the external service that the client is sending requests to.
  • Investigate the telemetry akka-http metrics to see what’s going on. Check if the number of requests is increasing.

Implementation

This is a growth rate monitor based on the measure of the client wait time as provided by Lightbend Telemetry metric akka_http_http_client_http_client_service_response_time_ns.

Note that the underlying metric is a Prometheus Summary with a 10 minute sliding window, so the percentile will be over all requests in the last 10 minutes.

akka_http_server_5xx

This monitor warns when the akka-http server generates 5xx errors (internal server errors, such as 500).

Failure Examples

  • A bug in the server causes requests to fail.
  • Timeouts are being hit somewhere, causing request to fail unexpectedly.
  • The server is throwing uncaught exceptions on some user input.
  • The server is returning 5xx as it is shutting down or starting up.

Suggested Actions

  • Investigate the server logs to see what’s going on.
  • Check the response body of the 5xx for more information.
  • Ensure your application has readiness probes set up so requests are not sent during start up or shutdown.

Implementation

This is a threshold monitor based on the rate of 5xx errors per second as provided by Lightbend Telemetry metric akka_http_http_server_responses_5xx.