The importance of observability

Over the past eight years, Lightbend has seen first-hand the importance of observability when developing and managing reactive and streaming applications. Distributed systems, by nature, are unpredictable. Despite our best efforts, failures and performance bottlenecks in such systems are inevitable—and can be difficult to isolate. In this complex environment, having deep visibility into the behavior of your applications is critical for software development teams.

The Costs of Deferred Observability

Without deep observability, it is natural to make assumptions about production system behavior, including what we think may be potential performance bottlenecks or failure scenarios. When failures do occur, we are often in the dark as to both the cause and the impact of potential fixes. This leads to wasted time and effort, jumping from one theory to another and one change to another change without fully understanding the impact on the system. If customers are impacted, the cost of this guess work to the business can escalate quickly. Historically, this has been the point at which Lightbend often receives an urgent call for help.

In production, while Kubernetes can help to recover from some failures, there are many scenarios that can cause a system to run sub-optimally or to fail continuously. Even when service availability is maintained, performance bottlenecks can result in premature auto-scaling, resulting in the excessive use of costly cloud computing resources. Indeed, there are cases where the first sign of failure in a non-observable system is the cloud computing bill.

Day 1 Observability

Observability is about bringing visibility into a system - turning the lights on, to see and understand the state of each component of the system, with context to aid with debugging and performance tuning. While traditional monitoring systems may have been the realm of operations engineers, today’s cloud-native applications must be developed with observability at the core of their design. Today, observability is a day 1 developer concern.

Building observable systems requires understanding the many ways in which they can fail. While it’s tempting to “just measure everything”, this creates an excess of information that results in no further insights into the system’s behavior and costly infrastructure expenditures. Knowing what to measure and how to measure it is critical. Repeated iterations of testing and observing are critical to understanding the many ways in which failures can occur.

While logs, metrics and traces are critical, these on their own are not enough for visibility to a system. Observability requires combining this data with rich context to create an understanding that is ideal for debugging and performance tuning. On day 1 with a Lightbend subscription, you are able to see the impact of their changes on system performance using Lightbend Telemetry, and Grafana templates:

Lightbend Telemetry provides deep visibility in the form of events, metrics and distributed tracing for components such as: Akka Actors, Akka HTTP, Akka Streams, Akka Cluster and Akka Persistence. This helps to answer questions such as: “What is the message mailbox time for an actor?”, “What is the distribution of sharded entities in my Akka Cluster?”, “What part of my Akka Stream has the worst latency?” and much, much more. A custom API is included for instrumenting domain-specific KPIs, such as number of orders processed.

Lightbend’s observability features do not stand on their own. They work together with your in-house monitoring setup to provide deep visibility into the behavior of your Lightbend applications. A wide range of integrations are offered, such that applications built with Lightbend technology can be monitored with your tooling of choice. These integrations include: Prometheus, Elasticsearch, Grafana, DataDog, New Relic, Kibana, SLF4J, StatsD, JMX as well as OpenTracing integrations for Zipkin and Jaegar.