From batch to streaming - the power of Cloudflow

The modern technological ecosystem gives businesses access to more data than ever before. The challenge becomes handling it and making sense of it in time to act strategically. Sticking with a traditional strategy of storing data and processing it in batches, often hours later or overnight, can result in lost opportunities.

Some example applications that motivate use of a streaming architecture include:

  • Handling and analyzing IoT data in real-time or near real-time: Thousands, or even millions, of data points have no value unless you can process them quickly and act accordingly. For example, avoiding expensive downtime by monitoring device performance and scheduling maintenance or replacement when behavior degrades.

  • Analyzing customer behavior: Anomalies in credit card purchases that could indicate fraud require immediate action to prevent costly issues for both the card issuer and the customer. Similarly, quickly detecting changes in customer behavior gives businesses the ability to offer new products and services when they are most likely to be accepted.

  • Enhancing the user experience with machine learning: Customers exhibit patterns when buying books and movies, support packages, or seeking expert advice. Recommendation engines tap into those patterns to boost the customer experience and increase revenue and loyalty.

As discussed in our ebook, Fast Data Architectures For Streaming Applications, streaming data applications have the same scalability and resiliency requirements as reactive microservices. Extracting business value from large volumes of data in real-time requires integration of existing data sources with streaming technologies and the rest of your ecosystem that wants to exploit analytics. Without in-house expertise, it is difficult and costly to create such systems.

A variety of components for managing data in-flight have emerged to meet these challenges. They enable streaming applications that support timely analysis, including machine learning and artificial intelligence. However, they can be hard to orchestrate, so Lightbend offers to address these challenges.

Cloudflow ties all of your streaming components together, allowing you to easily define, deploy, and operate multi-stage, multi-component flows of streaming data. Cloudflow eliminates the need for developers to write boilerplate code and provides operational tooling to improve developer productivity and automate essential operations.

Cloudflow—​backed by Lightbend experience—​simplifies design and deployment of streaming applications that use:

  • Apache Kafka: Serves as the messaging backbone for data streaming between services. It provides persistence and resiliency for data as it flows through your system

  • Apache Spark The industry standard for continuous processing of large data sets with a streaming engine.

  • Apache Flink A framework and distributed processing engine for stateful computations over unbounded and bounded data streams, which runs in all common cluster environments, performs computations at in-memory speed and at any scale.

  • Akka Streams: Enables reactive stream processing. When leveraged with Alpakka connectors, Akka Streams provides robust integration with external data sources for fast and efficient data ingestion. Data streamed into Apache Kafka can be further transformed using Apache Kafka Streams. Once data has been streamed into Cloudflow, it can use Kafka for further processing or send data to Akka Streams and Apache Spark for aggregations and other types of complex processing.

  • Integrations with machine learning systems: It’s one thing to train machine learning models. It’s quite another thing to deploy and management them successfully in streaming applications. Cloudflow provides tools and expertise for serving machine-learning models that leverage TensorFlow, Kubeflow, and other tools.

  • Lightbend Console: Provides a window into the streaming components and their runtime activities, making it simpler to manage and debug complex and streaming applications.

    Lightbend Console

You can deploy these streaming systems to Kubernetes®-based platforms in the cloud or in on-prem or hybrid environments.

It’s worth migrating many high-value batch applications to streaming to extract useful information sooner rather than later, even while many batch and other "offline" analytics, like data warehousing and machine learning model training, will remain essential for a complete environment.

Lightbend sponsors an open source version of Cloudflow. A Lightbend subscription gives you access to Cloudflow enterprise features, which makes building, deploying, and managing streaming applications as painless as possible. It also provides observability for your Cloudflow applications and enterprise level support.

When you’re building streaming applications, Cloudflow emphasizes how they are structured conceptually, then actually work at that level of abstraction. It allows you to visualize and observe your app like the following actual screen shot. From the top left, clockwise, components of the UI include:

  • Controls - icons to navigate to the GRAFANA dashboard or the infrastructure WORKLOADS view.

  • The blueprint graph pane, which gives you an overview of how the Cloudflow application is wired together

  • VIEW - selector that allows you to choose a time period

  • Application Details - rolls up health status and events

  • Selection: - pane that identifies the current selection and lets you choose between viewing Health and streamlet Shape

  • Metrics - graphs of Consumer lag and Throughput records along with other key metrics

Cloudflow Runtime User Interface

When you create a Cloudflow application, you think in terms of a blueprint that defines how streamlets connect, as shown in the blueprint graph pane above. A streamlet is an encapsulated unit of business functionality that you implement to manipulate stream data.

The screen shot above and the illustration below reflect one of the Cloudflow example applications, which simulates processing of call detail records used in Telecom systems. Here’s a wire-diagram of this blueprint with a little more detail:

Cloudflow Application Blueprint

Working from the left, the three streamlets each ingest a source of CDRs (simulated in the example app), their output is merged in the next streamlet, then sent to a streamlet that does parsing, validation, and transformation of the records. Errors from the parsing step are logged, while good records are sent to an aggregation streamlet that calculates various statistics over the moving data, finally sending the results downstream.

In the example code base, the aggregation streamlet is written in Spark Structured Streaming, while the others are written in Akka Streams. When you build the application, Cloudflow verifies the schemas on each end of the lines are compatible and it instantiates savepoints, the connections shown as lines between the streamlets. (These are actually automatically-generated Kafka topics, but that’s an implementation detail that could change.)

We recommend our ebook, Fast Data Architectures For Streaming Applications. It outlines some of the difficult decisions involved in choosing components to integrate when building streaming applications. Cloudflow reflects our opinionated view about the best way to solve these challenges—​relieving you from much of this hard work.

See the open source Cloudflow documentation and Cloudflow enterprise features documentation for more details on using Cloudflow, including expanded explanations of the concepts briefly mentioned here, like blueprints and streamlets.

To use the enterprise features of Cloudflow and take advantage of Lightbend subscription support, you must use the installer documented in the Cloudflow enterprise features documentation guide.