At the core of any mature monitoring systems lies the instrumentation components, metrics, and events. Metrics and events are the sources of truth from which we derive the answers to two fundamental questions:
- What is the state of our environment?
- How is our environment performing?
Metrics are one of the “go-to” standards for any monitoring system of which there are a variety of different types. At its core, a metric is essentially a measurement of a property of a portion of an application or system. Metrics make an observation by keeping track of the state of an object, through recording temporal points of data. These observations are some value or a series of related values combined with a timestamp that describes the observations, the output of which is commonly called time-series data.
The canonical example for metrics gathering is website hits, where we regularly collect various data points like the number of times someone visits a particular page and the location or source of the visitor for a given site. Examples of different types of metrics are as follows:
A single metric in and of itself is often not that useful, but when visualized over time especially in conjunction with a mathematical transformation, metrics can give great insight into a system. Following are some common transformations for metrics:
- Standard deviation
- Rates of change
An event is a historical record that something has happened such as an action or behavior of a system. For example, in Lightbend Monitoring we have several different types of events such as Dead Letter events, Unhandled Message events, Circuit Breaker Open events, and Mailbox Size Limit events to name a few. Each one of these events triggers as a result of some behavior that occurs within the application or system.
Events are very useful for monitoring as they provide a behavioral footprint of how the system is performing. Also, they are commonly stored in text format which if indexed properly provides a powerful search mechanism for analytics. In our sandbox solution, we use Elasticsearch built atop Apache Lucene a full-text search engine library.
There are two camps when it comes to collection, just as there are two ways to build organizational structure; centralized and decentralized. Both models have pro’s and con’s which the discussion of is beyond the scope of this document. In our case, we have chosen a decentralized agent-based method for the collection of data with a local agent running on each host that instruments a given application or service. Upon collection, the agent does some aggregation and then pushes the results to the intended recipient.
Part of “collection” is the means or direction in which the data flows for both aggregation and reporting of which there are two styles; pull and push. Sometimes referred to as “Blackbox” and “Whitebox” monitoring, pull vs. push can also be a hot topic for debate.
Pull/polling based solutions are quite common in monitoring and historically favor a centralized organizational structure, although this is not a requirement. Primarily in pull-based monitoring, the system asks or queries a monitored component, for example, pinging a host, and usually emphasize availability as the primary concern. One of the drawbacks to this approach is that the more applications and systems there are being monitored, the more explicit checks that need configuration.
Push-based monitoring works differently in that the application being monitoring becomes an emitter of data “pushing” metrics and events on some time interval or a constraint violation. This approach typically favors a decentralized structure and has the advantage of not requiring the monitoring system to “pre-register” the monitored component as the emitter pushes data to the configured destination upon start. Another advantage to push-based monitoring is the communication channel is unidirectional as the emitters do not listen for remote connections which in turn reduces the complexity of the security model for the network.
Like most things, there are some drawbacks and push-based monitoring is no different. The disadvantage is related to the volume of data being “pushed.” It too must be optimized as not to overwhelm the network by pushing too much data.
Logging is pretty much a staple of any monitoring system. Some consider logging to be a subset of events and in many ways that definition fits. In most cases, logs end up being used for auditing and, from a monitoring perspective, fault diagnosis.
When considering a logging system, there are several things to keep in mind. First, you’ll want a solution that is scalable and capable of processing large amounts data for reasonable retention times. Second, logging should be lightweight so that its overhead is minimal. The toolset which is part of the logging framework should provide the ability to parse and manipulate the log files and should also integrate with the other parts of your monitoring system, specifically events.
Time series based structure for data is a temporal list of a series of related data points. This type of structure is quite common for monitoring and usually taken in a sequence of equally spaced points in time. Examples of time series data are meter readings for a building automation system, heart rate monitoring for exercise and temperature readings for combustion engine telemetry.
In addition to standard time series structure, there is also dimensional or bi-temporal time series data. Dimensional time series data is where we record multiple readings for the same data point as more accurate readings appear. The accuracy of the data can be affected as a result of latency where one uses a predicted (time series forecasting) value until the real value arrives.
System-wide aggregation of data should reveal the overall health and well-being of a system. This type of aggregation is designed to act as roll-up and transformation of data to give various views and insight into the system’s state. In addition, aggregation can be used to play “what-if” scenarios to explore the potential side effects of modifying or introducing new behavior.
Aggregation of data provides a context around different data points. Typically aggregation occurs at a centralized location and can provide powerful insight into the behavior of an application or system especially when dynamic querying is available. The problem becomes latency where one needs near realtime or realtime results, and the cost of waiting for remote aggregation is too high. To solve this potential problem with lag, we employ local aggregation.
Local aggregation, in particular for monitoring, is an optimization and follows the idea of keeping your data near. Many metrics and events are part of a larger aggregated view that can be computed locally on the fly rather than send the raw data to another location for analysis. You can and often do send the data to a remote aggregator, but in many cases, this becomes additional overhead that is not needed.
Storage represents the persistence layer of a monitoring architecture and comes in three temperatures, hot, warm and cold.
Hot storage, or hot data as it’s commonly referred to, is data that is frequently accessed and stored in memory or on a flash drive in hybrid storage environments. Warm storage or warm data is data that is accessed less frequently and stored on slower storage mechanisms such as magnetic disk. Cold storage is reserved for data that is rarely accessed or archived. This notion of multi-temperature data management is common in big or fast data architectures as well as monitoring.
Analytics, as defined by Wikipedia, is “the discovery, interpretation, and communication of meaningful patterns in data.” From a monitoring perspective, these “meaningful patterns in data,” provide the necessary insight into a system or application that allows us to determine and quantify its performance. Also, analytics can setup a baseline for optimization and control of a system by way of historical analysis. Let’s take a look at some of the common concepts found in analytics related to monitoring.
Fast data is a relatively new term when it comes to analytics. Historically we think of Big data or data warehousing where data is captured on distributed file systems and then processed in batches. While this approach has been the go-to strategy for analytics, it increasingly puts one at a competitive disadvantage due to latency. As a result, analytics is beginning to adopt a new approach based on streaming where the processing of data occurs as it arrives.
In an ideal system both methods, big data and fast data, are used. For more information on this topic see Fast Data Architectures for Streaming Applications by Lightbend’s fast data expert, Dean Wampler, PhD.
Real-time data means applications or systems that are subject to constraints where the application guarantees a response within a specified time limitation. Also, “real-time” response is usually understood to be within milliseconds and even microseconds especially automated systems that implement control.
From a monitoring perspective, we adopt the term “near real-time” when monitoring distributed systems due to the inherent latency involved. Near real-time or (NRT) data accounts for the time delay by introducing an acceptable calculated or smoothed latency metric.
Another reason why NRT is acceptable is the human factor. Until we have a fully automated monitoring system (one that will act upon the data to fix underlying problems), there will be people involved in the process. Therefore, a sub-second delay is perfectly fine because us humans are operating on a second or even minute basis.
One of the problems with constraints is they usually do not take into consideration context or historical patterns. For example, if you set a restriction on an actor mailbox to be no more than fifty messages per second and when the actor exceeds this count notify someone, this may or may not be a legitimate concern.
A better approach is anomaly detection, where through events and observations you establish a pattern by which the actor behaves. Then as the actor performs, its behavior is matched against the established pattern. In the mailbox example above we might determine an average mailbox throughput of thirty to sixty messages per second. If we all of a sudden spike to one hundred messages per second, this could be an anomaly, but again it has to be considered in the context of the established behavior.
Notifications and visualizations are one of the primary outputs of a monitoring system. Notifications often come in the form of emails, pop-ups, SMS messages and other related forms of communication to let you know that something may be awry with your system or application.
Notifications may seem like a simple part of monitoring, but they can be rather complex. The primary reason for the complexity is it is crucial to keep in mind the context in which the notification or alert triggers. The result of not binding notifications to a context ends in a poorly managed monitoring system which often causes “alert fatigue.” Following is a list of several contextual related concerns with regards to notifications and alerts worth considering when implementing any monitoring solution.
- Who do you tell about the problem?
- In what way do you tell them?
- At what frequency do you tell them?
- At what point do you stop telling them?
- At what point do you escalate?
As mentioned above, the result of not considering context when implementing notifications is “alert fatigue.” If you generate too many notifications or create notifications based on some threshold that does not represent a real problem, users will begin to ignore the communication.
It has been said a picture is worth 1000 words, and we can say similar for monitoring and visualizations. Visualizations in the form of tables, charts, and graphs in a monitoring system provide a powerful analytic tool for reasoning about the state and performance of a system.
Visualizations, like notifications, are not a simple domain. As the consumer of visualizations, users can sometimes suffer from apophenia, a term coined by Klaus Conrad in his published monograph “The onset of schizophrenia: an attempt to form an analysis of delusion.” Apophenia is where we perceive meaningful patterns from random data which in turn can lead to sudden jumps from association to causation. Following is a list of some of the key concerns one should consider when implementing visualizations.
- Data must be clearly shown
- Graphs and such should cause the viewer to think about substance not visuals
- Smoothing should not distort the data
- Data sets that include lots of data should be coherent
- Changing granularity should not impact comprehension
Continuous delivery is the notion of continuously delivering your code to QA or production with the ability to act in real-time to the results of the release which is markedly different from the traditional model of planning, coding, and release. Most of us are used to the traditional release cycle but struggle with its challenges. In the traditional release model, we encounter three core problems: planning is hard, our priorities are ever changing, and unforeseen problems often arise. Continuous delivery, on the other hand, provides a fresh way to view our release schedule with some distinct benefits:
- Productivity improvements
- Frequent releases
- Increased issue resolution
- Faster feedback loop
- Better customer experience
- Reduce in stress on the team
The main hurdle to continuous delivery is the implementation of a feedback loop which allows the team to react in real time. The key to solving this problem is monitoring. In a continuous delivery environment, it is essential that delivery endpoint, be it QA or production, is observed, providing feedback in the form of notifications and alerts with critical visualization in place. Without this tight integration with monitoring, continuous delivery is not achievable.