Sample Applications

Fast Data Platform provides several sample applications that illustrate how to use the components included in Fast Data Platform to solve particular application goals. Hence, a quick way to learn how to use Fast Data Platform for your own applications is to start with one of the sample applications.

The sample applications are available as open source in a GitHub repository, https://github.com/lightbend/fdp-sample-applications/, where you’ll find scripts for building and launching the applications in Fast Data Platform. Pre-built Docker images for the applications are hosted on the Lightbend DockerHub repository, https://hub.docker.com/u/lightbend/.

Disclaimer: The sample applications are provided as-is, without warranty. They are intended to illustrate techniques for implementing various scenarios using Fast Data Platform, but they have not gone through a robust validation process, nor do they use all the techniques commonly employed for highly-resilient, production applications. Please use them with appropriate caution.

The design choices in the applications were driven by three considerations:

  • What tools to pick for a particular problem scenario.
  • How to write applications using those tools.
  • How to deploy and run those applications in Fast Data Platform.

The following sample applications are provided at the time of this writing, but more applications are added on a regular basis, outside the regular Fast Data Platform release process. The links go to the corresponding section in the GitHub repository.

Name GitHub URL Description
VGG Training on cifar-10 data using BigDL link Intel BigDL VGG Training ML app, using SparkML. New BigDL sample applications are under development, so check the repo regularly
Network Intrusion link Network intrusion detection with Spark Streaming K-Means
Taxi Times with Flink link Predicting Taxi Travel Times with Flink
Web Log Processing with Kafka Streams link Kafka Streams Sample App for Web Log Processing
Model Serving in Streams link One approach to serving data science models in production streaming jobs using Kafka, Akka Streams, and Kafka Streams
KillrWeather link A port to Fast Data Platform of the well-known demonstration app, KillrWeather. It demonstrates a stream pipeline including Kafka, Spark Streaming, and Spark Structured Streaming (the newer replacement for Spark Streaming), Cassandra, Akka middleware, and other tools. This is the most complete sample application

In this section, we provide an overview of each sample application, at the time of this writing. Additional applications are added periodically. Check the GitHub repository for the latest information.

Each of the samples has detailed instructions in the corresponding README files. These instructions cover how to build and run the applications locally (i.e., on a laptop), how to build Docker images for them, and how to deploy Lightbend’s Docker images or your own in Fast Data Platform.

Preliminaries

Each sample application are provided in two forms:

There are bash scripts in the bin folder of each application that can be used to deploy the Docker images to Fast Data Platform. We recommend starting with the Lightbend DockerHub images, then use the source code to study the details and create modifications.

Prerequisites

Before you use the sample applications, the following prerequisites must be met:

  1. Fast Data Platform is up and running
  2. Services for Kafka, HDFS, and Spark are installed using Fast Data Platform Manager
  3. Optional, certified services are installed for some applications, as discussed below
  4. The DC/OS CLI is installed on your workstation
  5. The corresponding DC/OS CLI command plugins for Kafka, HDFS, and Spark are also installed

See Install DC/OS CLI on Your Local Computer for more details on installing the DC/OS CLI. If you run KillrWeather, which requires Cassandra, the Cassandra DC/OS CLI is useful, but not required.

Use of Grafana, InfluxDB, Zeppelin, and Cassandra

Several certified tools are used by the sample applications. The links go to installation instructions under Installation: Next Steps:

Tool Installation Description
Grafana link Used to visualize data from KillrWeather and the Network Intrusion application
InfluxDB link Used to store time-series data in several applications, which is then graphed with Grafana
Zeppelin link Use in KillrWeather to query data in the Cassandra tables for KillrWeather. Also useful for general data science projects
Cassandra link Use in KillrWeather to store computed statistics, etc.

(The specific versions of these components are listed in the Release Notes.)

Some further setup is required for each of these services, such as creation of Grafana dashboards and creation of InfluxDB and Cassandra tables. However, in most cases, a sample app will do these steps automatically. In a few cases, manual steps are required, as discussed below.

Note: Cassandra, InfluxDB, Grafana, and Zeppelin are certified, not supported components in Fast Data Platform (see here for definitions of these terms).

Overview of the Sample Applications

Now we’ll briefly discuss each sample application. When a relative path is shown, e.g., for a script, it is relative to the root directory of your local copy of the GitHub repository.

BigDL and SparkML for VGG Neural Networks

This app demonstrates using Intel’s BigDL library for deep learning with Spark, where a VGG neural network is trained on the CIFAR-10 image dataset. Then the app runs standard validation tests on the trained model.

The Intel BigDL is a library for deep learning model training and serving on Spark.

NOTE: This application demonstrates using BigDL as an example of a third-party, certified library for machine learning. We do not provide support for BigDL.

To deploy the application using the Docker images in the lightbend DockerHub repository, run the script apps/bigdl/bin/app-install.sh. Detailed instructions of how to build, install, and remove applications can be found in the README.

See also Network Intrusion Detection with Streaming K-Means, which also uses Spark ML.

Network Intrusion Detection with Streaming K-Means

This application uses the Apache Spark MLlib implementation of the K-Means algorithm, which finds K clusters in a data set (for some integer K). While K-Means can be computed as a batch process over a fixed data set, Apache Spark Streaming can also be used to compute K-Means incrementally over streaming data. This app demonstrates how to use this algorithm in the context of looking for network traffic anomalies that might indicate intrusions. The application also functions as a typical Spark example.

Kafka is also used as part of the data pipeline.

The application has the following components that form stages of a pipeline:

  1. Data Ingestion: The first stage reads data from a folder which is configurable and watchable. You can put new files in the folder, and the file watcher will kick-start the data ingestion process. The first ingestion is however automatic and will be started 1 minute after the application installs.
  2. Data Transformation: The second stage reads the data from the Kafka topic populated in step 1, performs some transformations that will help in later stages of the data manipulation, and writes the transformed output into another Kafka topic. If there are any errors with specific records, these are recorded in a separate error Kafka topic. Stages 1 and 2 are implemented as a Kafka Streams application.
  3. Online Analytics and ML: This stage of the pipeline reads data from the Kafka topic populated by stage 2, sets up a streaming context in Spark, and uses it to do streaming K-means clustering to detect network intrusion. A challenge is to determine the optimal value for K in a streaming context, i.e., by training the model, then testing with a different set of data. (More on this below.)
  4. An implementation of batch K-means: Using this application, the user can iterate on the number of clusters (K) that should be used for the online anomaly detection part. The application accepts a batch duration and for all data that it receives in that interval of time, it runs K-means clustering in batch for all values of K that fall within the range as specified by the user. The user can specify the starting and ending values of K and the increment step size as command line arguments and the application will run k-means for the entire range and report the cluster score (mean squared error). The optimal value of K can then be found using the elbow method.
  5. Visualization with InfluxDB and Grafana: The resulting data is written to InfluxDB and graphed with Grafana. Both must be installed as discussed above.

To deploy the application using the Docker images in the lightbend DockerHub repository, run the script apps/nwintrusion/bin/app-install.sh. Detailed instructions of how to build, install, and remove applications can be found in the README.

Once the application is installed, the visualization of (possibly) anomalous intrusions can be seen by logging in to Grafana using admin / admin and selecting the dashboard named NetworkIntrusion from the Dashboards menu. There is a preset alert in the dashboard that can be customized to set up the proper threshold beyond which we classify points as anomalous.

See also the README section Output of Running the App for more details on how to set up the visualization of this application,

The sample apps Kafka Streams Sample App for Web Log Processing and Akka Streams and Kafka Streams Model Server also use Kafka Streams. BigDL + SparkML also uses Spark ML.

Kafka Streams Sample App for Web Log Processing

This application demonstrates the following features of Kafka Streams:

  1. Building and configuring a Streams based topology using the Kafka Streams DSL as well as the lower level processor based APIs
  2. Transformation semantics applied to streams data
  3. Stateful transformations using local state stores
  4. Interactive queries in Kafka streams applied to a distributed application
  5. Implementing custom state stores
  6. Interactive queries over custom state stores in a distributed setting

The KStreams app has 2 separate main applications, both accessible through HTTP interfaces:

  1. An app that uses the Kafka Streams DSL APIs to compute aggregate information from stateful streaming like the total number of bytes transferred to a specific host and the total number of accesses made on a specific host. These can be computed on a windowed aggregation as well.
  2. An app based on the Processor APIs that implements a custom state store in Kafka Streams to check for membership information. It uses a bloom filter to implement the store on top of the APIs that Kafka Streams provides. Then it consumes the Clarknet dataset and gives the user an HTTP interface to check if the application has seen a specific host in its pipeline or not.

To deploy the application using the Docker images in the lightbend DockerHub repository, run the script apps/kstream/bin/app-install.sh. Detailed instructions of how to build, install, and remove applications can be found in the README.

The sample apps Network Intrusion Detection with Streaming K-Means and Akka Streams and Kafka Streams Model Server also use Kafka Streams.

Predicting Taxi Travel Times with Flink

This application is adapted to Fast Data Platform from the publicly-available Flink training from dataArtisans. It uses a public dataset of taxi rides in New York City. The details of the dataset can be found here. The application trains a regression model on the taxi data to create a classifier that predicts how long a ride will take.

To deploy the application using the Docker images in the lightbend DockerHub repository, run the script apps/flink/bin/app-install.sh. Detailed instructions of how to build, install, and remove applications can be found in the README.

Akka Streams and Kafka Streams Model Server

This application illustrates one solution to a common design problem, how to serve Data Science models in a streaming production system.

Detailed instructions of how to build, install, and remove applications can be found in the README.

The application has the following main components, each of which has its own Docker image:

  1. An Akka Streams implementation of the model serving
  2. A Kafka Streams implementation of the model serving
  3. A data loader for running either one of model serving components

In addition to images, data used for testing is packaged in a Zip file and made available via HTTP, http://s3-eu-west-1.amazonaws.com/fdp-killrweather-data/data/data.zip.

Fast Data Platform KillrWeather

This is a port to Fast Data Platform of the well-known demonstration app, KillrWeather. It is the largest of the sample applications. It combines Kafka, Spark Streaming, Akka-based middleware for stream processing, Cassandra, and Zeppelin.

Detailed instructions of how to build, install, and remove applications can be found in the README.

Two Fast Data Platform services are required: Kafka and HDFS. (While Spark is also used, the DC/OS Spark service is not required.) Follow the installation instructions in the Fast Data Platform documentation for installing the platform and these two components.

This app also uses the certified services, Cassandra, InfluxDB, and Grafana.

Note: Deploy all the required services before deploying KillrWeather. At start up, KillrWeather will detect their presence and automatically define tables in Cassandra and InfluxDB, and define a dashboard in Grafana. Zeppelin does not need to be installed in advance.

There are a few manual steps required for Zeppelin, if you want to use Zeppelin to view the contents of Cassandra tables (optional). The README describes these steps.

The architecture of KillrWeather is shown in the following diagram:

KillrWeather

Historical weather data is fed to a Kafka topic using one of three clients, one that uses the new GRPC protocol, one that uses the Kafka native API, and one that uses REST. From that topic, data is read by Spark Streaming app (both ordinary and structured streaming implementations are available), where statistics over the data are computed. The raw data and the results are written to two places. Output to Cassandra is an example of a longer-term storage option suitable for downstream use. To see the data in Cassandra, instructions are provided for using a Zeppelin notebook configured with Cassandra Query Language support. Writing the output to InfluxDB, a time-series database, supports graphing the raw and processed data in real time using Grafana.