Aggregation

Apache Kafka

Apache Kafka is a fast, durable and scalable publish-subscribe messaging bus in the form of a distributed commit log. Kafka is designed such that a single cluster can be elastic and work as a centralized hub for an organization of virtually any size and implements durability by persisting to “warm-storage” (disk) all messages in case the cluster goes down. We should point out that Kafka is one of those tools that can be considered both a collector and a store. As a publish-subscribe messaging bus, it acts more like a collector, but it’s durability gives it the features of a persistent store as well.

While we don’t have direct support to Kafka, you can “plug” it into our monitoring solution using the Etsy StatsD backend via a REST proxy endpoint.

Apache Flume

Apache Flume, like Kafka, is both a collector and storage mechanism based on streaming data flows. The Apache Flume documentation defines itself as a distributed, reliable, and available system for efficient collection, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Based on event consumption, Apache Flume uses what it calls Flumes to consume events from an external source, for example, a web server. The external source sends events to Flume in a format recognized by the target whereby the event gets stored in a channel. A channel is a passive store that persists the event until a sink consumes it. The sink then removes the event and stores it in a repository like HDFS or forwards it to the next sink in line.

StatsD

StatsD provides configurable aggregation for downsampling, flush interval, pattering matching and retention time.

CollectD

CollectD as well supports aggregation via their plugin architecture. The Aggregate plugin allows for consolidation via functions such as sumation and averages.