Putting it all together

Up to this point, we have discussed the challenges of monitoring at scale, reviewed the key concepts and terms, and taken a look at some of the standard tools available. With this understanding, we can now walk through what a potential implementation might look. One thing we would like to re-iterate is that monitoring, especially at scale, has many moving parts, and is driven in large part by context. Unfortunately, there is no one-size-fits-all solution. As with all distributed systems, more often than not tuning is required and monitoring is no different.

Additionally, the following recommendations for production monitoring are in no way meant to be an exhaustive list of solutions; rather they should be considered one approach that can be modified to meet a given use case by swapping out one tool for another.

Introduction

Let’s start with laying out what our expectations of our monitoring system should be and identify our example use case.

Expectations

Following are list of expectations for our monitoring at scale implementation:

  • We will base the core of our solution on events, metrics, logging and traces.
  • We will favor push-based “whitebox” monitoring over pull-based “blackbox” monitoring.
  • Visualization will be a key component allowing us to view the state of our system.
  • We will provide contextual notifications as not to induce alert fatigue.
  • We will tune Lightbend monitoring for the particular use case (context) to reduce the potential overhead created by monitoring (the observer effect).

Use case

Our use case will focus on an Akka-based system that includes a front-end service which routes requests to a backend that persists information to a database. The front-end service will implement the router pattern to handle multiple requests as will the backend for persistence. The application will also contain a logic module that will spin up 1-to-n number of short-lived worker actors which do some calculation required to validate the request before persisting. Following is a diagram of our use case:

Use Case

The tools used

  • Instrumentation: Our “whitebox” or pushed-based architecture will focus on instrumenting with local aggregation of the event and metric data via the Lightbend monitoring toolkit.
  • Collection: For collection, we will first push to our metrics registry for local aggregation and then report to CollectD.
  • Aggregation: A distributed Apache Kafka cluster will be used for aggregation providing a scalable, durable temporary storage
  • Storage: Elasticsearch will be our storage mechanism for events and for metrics we will use a Cassandra TSDB cluster.
  • Analytics: Our analytics tool will be Spark. We will not cover Fast Data integration in detail as it’s beyond the scope of this document and as mentioned above we have an excellent Fast Data Resource; Fast Data Architectures for Streaming Applications by Lightbend’s fast data expert, Dean Wampler, PhD.
  • Notifications: Riemann notifications will be the notification service we use, specifically email integration. As with analytics, we will not do a deep dive into notifications as there are many excellent resources available on the internet, but for this document will show how to integrate it with the proposed monitoring solution.
  • Visualization: For visualizations, we will rely upon Grafana as our dashboard for metrics and events and Kibana data exploration and discovery.

Now that we have a sense of our expectations, use case, and tools, let’s dig into what our monitoring architecture looks like, how we configure Lightbend monitoring for the given context and how the selected tools integrate for a complete solution.

Implemented architecture

Let’s start by taking a look at a diagram of our proposed monitoring solution for production.

Solution

There are many parts in figure above, so we have numbered the elements to ease the discussion. Also, we mentioned above that some of the features such as Analytics and Notifications, while part of the solution, will not be discussed in detail because they are beyond the scope of this document. Last, where appropriate we will link to external resources for a more in-depth view of a particular element’s configuration and capabilities. Let’s walk through the architecture and dive into the details:

  1. Microservices are considered an extension of service-oriented architectures (SOA) and typical for distributed systems. More often than not microservices focus on a particular function or responsibility within a larger system. In our monitoring example, the microservice builds upon an Akka cluster comprised of a 1-to-n number of nodes and instrumented by way of Lightbend monitoring. See Akka Cluster Usage for more information on Akka cluster.
  2. In a zoomed view of a given node within our Akka microservice cluster, we see several actors. Actor’s one and two represent "top-level” actors with Actor one having two types of children. The first category accounts for 1-to-n temporary workers, and the second a typical routee. You will also notice the metrics repository and the collectd for collection. These two components are part of the Lightbend monitoring platform for capturing, locally aggregating and pushing metrics and events via collectd.
  3. This section of the diagram represents the core of Lightbend monitoring. You will notice that we monitor four kinds of actors, one of them being the worker actor. The worker is short-lived and could potentially number in the millions, so we must optimize our configuration based on this context. As mentioned above there is no free lunch and monitoring overhead, or the “observer effect” is a real concern that must be mitigated. Therefore, for the worker, rather than reporting per instance we will report per class, in turn, aggregating their metrics and events. Also, you will see the metrics registry and CollectD mentioned in section two of the diagram. The last component, which is not evident in the picture, is the Cinnamon Agent. The Cinnamon Agent is a Java agent that weaves instrumentation during load-time into our frameworks.
  4. Streams of data are ingested into a distributed Kafka cluster capturing data from CollectD for metrics, events, and other system inputs from the system as a whole. Kafka will act as the “backbone” plumbing for aggregation and propagation of data throughout our system.
  5. As we discussed above, Riemann is an event stream processor which provides the ability to define rules based on streams of events. In our case “events” also include metrics and logs, which provide dynamic “views” into the state and health of the system or service that we monitor. From these “views” we can make decisions about which “events” we choose to persist in Elasticsearch and how we respond to a potential fault by way of notification. A full review of what Riemann can do is beyond the scope of this document, but you can think of it as our event processing rules engine.
  6. We have chosen Elasticsearch as our primary “warm” store for monitoring event data. Elasticsearch is a logical choice as it has a stable support structure, is clusterable providing elasticity and scalability, and works with a plethora of plugins and other tools.
  7. We will use a Cassandra TSDB based cluster for “warm” and “cold” storage for time-series based metrics data.
  8. If you have Spark and a persistent store, like HDFS and/or a database, you can still do batch-mode processing and interactive analytics. Hence, the architecture is flexible enough to support traditional analysis scenarios too.
  9. The mini-batch model of Spark is ideal when longer latencies are tolerable and the additional window of time is valuable for more expensive calculations, such as training machine learning models using Spark’s MLlib or ML libraries or third-party libraries. Spark Streaming is evolving away from being limited only to mini-batch processing, and will eventually support low-latency streaming too, although this transition will take some time. Efforts are also underway to implement Spark Streaming support for running Beam dataflows.
  10. Grafana will be used as the primary visualization tool for metrics and events.
  11. We will use Kibana to provide search, view, and interaction with data stored in Elasticsearch indices. This communication will allow us to quickly perform advanced data analysis and visualize our data in a variety of charts, tables, and maps. The key aspect that Kibana will provide for us is the ability query our Elasticsearch information in a dynamic fashion.