From batch to streaming - the power of Akka Data Pipelines

The modern technological ecosystem gives businesses access to more data than ever before. The challenge becomes handling it and making sense of it in time to act strategically. Sticking with a traditional strategy of storing data and processing it in batches, often hours later or overnight, can result in lost opportunities.

Some example applications that motivate use of a streaming architecture include:

  • Handling and analyzing IoT data in real-time or near real-time: Thousands, or even millions, of data points have no value unless you can process them quickly and act accordingly. For example, avoiding expensive downtime by monitoring device performance and scheduling maintenance or replacement when behavior degrades.

  • Analyzing customer behavior: Anomalies in credit card purchases that could indicate fraud require immediate action to prevent costly issues for both the card issuer and the customer. Similarly, quickly detecting changes in customer behavior gives businesses the ability to offer new products and services when they are most likely to be accepted.

  • Enhancing the user experience with machine learning: Customers exhibit patterns when buying books and movies, support packages, or seeking expert advice. Recommendation engines tap into those patterns to boost the customer experience and increase revenue and loyalty.

As discussed in our ebook, Fast Data Architectures For Streaming Applications new tab, streaming data applications have the same scalability and resiliency requirements as Reactive Microservices. Extracting business value from large volumes of data in real-time requires integration of existing data sources with streaming technologies and the rest of your ecosystem that wants to exploit analytics. Without in-house expertise, it is difficult and costly to create such systems.

A variety of components for managing data in-flight have emerged to meet these challenges. They enable streaming applications that support timely analysis, including machine learning and artificial intelligence. However, they can be hard to orchestrate, so Lightbend offers to address these challenges.

Akka Data Pipelines ties all of your streaming components together, allowing you to easily define, deploy, and operate multi-stage, multi-component flows of streaming data. This eliminates the need for developers to write boilerplate code and provides operational tooling to improve developer productivity and automate essential operations.

Akka Data Pipelines—​backed by Lightbend experience—​simplifies design and deployment of streaming applications that use:

  • Apache Kafka new tab: Serves as the messaging backbone for data streaming between services. It provides persistence and resiliency for data as it flows through your system

  • Apache Spark new tab The industry standard for continuous processing of large data sets with a streaming engine.

  • Akka Streams new tab: Enables reactive stream processing. When leveraged with Alpakka connectors new tab Akka Streams provides robust integration with external data sources for fast and efficient data ingestion. Data streamed into Apache Kafka can be further transformed using Apache Kafka Streams. Once data has been streamed into your application, you can use Kafka for further processing or send data to Akka Streams and Apache Spark for aggregations and other types of complex processing.

  • Integrations with machine learning systems: It’s one thing to train machine learning models. It’s quite another thing to succesfully deploy and manage them in streaming applications. Akka Data Pipelines provides tools and expertise for serving machine-learning models that leverage TensorFlow new tab, Kubeflow new tab, and other tools.

  • Lightbend Telemetry used with Prometheus and Grafana: Provides a window into streaming components and their runtime activities, making it simpler to manage and debug complex and streaming applications.

You can deploy these streaming systems to Kubernetes® new tab-based platforms in the cloud or in on-prem or hybrid environments.

It’s worth migrating many high-value batch applications to streaming to extract useful information sooner rather than later. Even so, batch and other "offline" analytics, like data warehousing and machine learning model training, will remain essential for a complete environment.

When you’re building streaming applications, Akka Data Pipelines emphasizes how they are structured conceptually, then actually work at that level of abstraction. It allows you to visualize and observe your app. When you create an Akka Data Pipelines application, you think in terms of a blueprint that defines how streamlets connect. A streamlet is an encapsulated unit of business functionality that you implement to manipulate stream data. The following shows the blueprint and streamlets for an example application that simulates processing of call detail records (CDRs) used in Telecom systems.

Application Blueprint

Working from the left, the three streamlets each ingest a source of CDRs (simulated in the example app), their output is merged in the next streamlet, then sent to a streamlet that does parsing, validation, and transformation of the records. Errors from the parsing step are logged, while good records are sent to an aggregation streamlet that calculates various statistics over the moving data, finally sending the results downstream.

In the example code base, the aggregation streamlet is written in Spark Structured Streaming, while the others are written in Akka Streams. When you build the application, Akka Data Pipelines verifies the schemas on each end of the lines are compatible and it instantiates savepoints, the connections shown as lines between the streamlets. (These are actually automatically-generated Kafka topics, but that’s an implementation detail that could change.)

We recommend our ebook, Fast Data Architectures For Streaming Applications new tab. It outlines some of the difficult decisions involved in choosing components to integrate when building streaming applications. Akka Data Pipelines reflects our opinionated view about the best way to solve these challenges—​relieving you from much of this hard work.