Streaming data pipelines simplified

Streaming data in motion is much more interesting than data at rest.

Developing, orchestrating, and operating distributed streaming applications on Kubernetes (K8s) is a difficult proposition. Tools from Lightbend, such as Akka Data Pipelines, can accelerate development and decrease risks. Akka Data Pipelines simplifies the complexities of Streaming Data Pipelines, making it faster and easier to build, test, deploy and run them.

This guide covers the following topics:

  • What is Akka Data Pipelines and why you should care?

  • How to set up a local K8s test environment using Microk8s.

  • How to install Akka Data Pipelines.

  • How to test and run a sample working windmill sensor data application.

    This sample demonstrates an Internet of Things (IoT) Proof of Concept (PoC), built using either of the Java or Scala programming languages and the open source Akka Streams toolkit.

What is Akka Data Pipelines

The Cloudflow open source project enables you to quickly develop, orchestrate, and operate distributed streaming applications on Kubernetes. Akka Data Pipelines adds Lightbend Console and Telemetry to the capabilities of Open Source Cloudflow:

  • Lightbend Console enables you to observe and monitor Lightbend Platform applications running on Kubernetes.

  • Telemetry provides the needed visibility into the health and performance of your production streaming services.

Akka Data Pipelines includes the benefits associated with a Lightbend Subscription such as training and support.

Building Streaming Data Pipelines with Cloudflow

Cloudflow assists you in quickly developing distributed streaming applications on K8s by eliminating the need for writing boilerplate code and the associated Yaml files for deployment and configuration. This capability massively reduces the time required to create, package, and deploy streaming applications—​from weeks to hours.

Cloudflow streaming applications contain small, composable components, called streamlets. Streamlets connect together through strongly typed schema-based contracts in a “Schema First” approach. To create a streaming data flow, you simply supply schemas, in either Avro or Protobuf format, that represents the domain model. In this guide, we’ll use Protobuf for the example.

You create streaming flows by assembling streamlets together, and connecting their schemas, in a single deployable blueprint. The blueprint declares which streamlets join together to form our pipeline. For example, the IoT PoC blueprint is as follows:

blueprint {
  streamlets {
    grpc-ingress = sensordata.SensorDataGrpcIngress
    metrics = sensordata.SensorDataToMetrics
    validation = sensordata.MetricsValidation
    valid-logger = sensordata.ValidMetricLogger
    invalid-logger = sensordata.InvalidMetricLogger
  }

  topics {
    sensor-data {
      producers = [grpc-ingress.out]
      consumers = [metrics.in]
    }
    metrics {
      producers = [metrics.out]
      consumers = [validation.in]
    }
    invalid-metrics {
      producers = [validation.invalid]
      consumers = [invalid-logger.in]
    }
    valid-metrics {
      producers = [validation.valid]
      consumers = [valid-logger.in]
    }
  }
}

The visual model shown below illustrates the domain model and connections in the blueprint. The model depicts streamlets as processes, and schemas as hexagons that are color coded by type.

domain model

Notice the insertion of the Apache Kafka topics between each connected streamlet. In this example, Kafka persists elements of the data stream between streamlets. Kafka topics add resiliency to the overall flow and allow streamlets to be scaled up or down independently. In the case of failure, streamlets are restarted, and pick up right where they left off without losing any data.

Streamlets can be created using the best streaming engine for the task at hand; including Akka Streams, Flink, or Spark. Akka and Flink streamlets can be developed in Scala or Java, Spark streamlets can only be developed in Scala.

Similar to puzzle pieces, streamlets also have shape. A streamlet’s shape is defined by its inlets and outlets and their associated schema types. Each streamlet may have zero or more inlets and outlets.

As shown in the example below, the Validation streamlet has one inlet of the type Metrics, and two outlets, with one being a type of Invalid-Metrics, and the other being Valid-Metrics. This shape is also known as a fan-out.

validation streamlet

As mentioned previously, Streamlets aren’t required to have either an inlet or an outlet. In these cases, streamlets are usually integrating with something outside of the application. For example, a common use case is to have an Akka streamlet with no inlet, however data from the outside world is pulled into the flow through one of the Alpakka streaming connectors. Another common use case is for a streamlet without an outlet to persist data in a database or data lake at the end of a flow.

The Akka Data Pipelines framework greatly simplifies the complexities of building, operating, and monitoring streaming data pipelines. Teams can focus on building business logic and the application of machine learning to actionable data instead of worrying about the underlying framework.

Next, we’ll take a look at running K8s locally on your machine and then running our sample Akka Data Pipelines IoT PoC application.