Sample Applications

Fast Data Platform comes bundled with several sample applications that illustrate how to use the components included in Fast Data Platform to solve particular application goals. The sample applications are packaged in a pre-built Docker image for convenient execution. They are also provided as source code distributions with scripts for building and launching the applications in your Fast Data Platform cluster.

Hence, a quick way to learn how to use Fast Data Platform for your own applications is to start with one of the sample applications.

For our purposes, there are three considerations:

  • What tools to pick for a particular problem domain.
  • How to write applications using those tools.
  • How to deploy and run those applications in Fast Data Platform clusters.

The following sample applications are provided.

The first set is packaged in the fdp-sample-apps-1.0.0.zip archive that’s part of the distribution.

  • BigDL VGG Training ML app, using SparkML (bigdl directory)
  • Network intrusion detection with Spark Streaming K-Means (nwintrusion directory)
  • Predicting Taxi Travel Times with Flink (flink directory)
  • Kafka Streams Sample App for Web Log Processing (kstream directory)

Another, large sample application is packaged in fdp-akka-kafka-streams-model-server-1.0.0.zip. It demonstrates solutions to a common production problem; how to serve data science models in production.

Finally, the most complete sample application is fdp-killrweather-1.0.0.zip, a port to Fast Data Platform of the well-known demonstration app, KillrWeather. It demonstrates a stream pipeline including Kafka, Spark Streaming and Spark Structured Streaming (the newer replacement for Spark Streaming), Cassandra, Akka middleware, and other tools.

Each of the examples has detailed instructions in the corresponding README files. Here, we’ll only provide an overview.

Preliminaries

The sample applications are provided in two forms, as source distributions that are snapshots of their respective git repos (which we intend to open source completely) and prebuilt in a Docker image. We recommend using the image to install the apps in your Fast Data Platform cluster, then use the source distribution for studying the applications and creating modifications.

We call this Docker image a laboratory, a term we use below and in the project READMEs. When installed, it’s service name is fdp-apps-lab.

Prerequisites

Before you use the sample applications, the following prerequisites must be met:

  1. Lightbend Fast Data Platform cluster is up and running.
  2. Services for Kafka, HDFS, and Spark installed using Fast Data Platform Manager.
  3. Cassandra service is required for KillrWeather; install it from the DC/OS Catalog.
  4. The DC/OS CLI is installed on your workstation. See the section Install DC/OS CLI on Your Local Computer for more details.
  5. The corresponding DC/OS CLI command plugins for Kafka, HDFS, and Spark are also installed. The Cassandra DC/OS CLI is useful when you use Cassandra, but it’s not required for KillrWeather.

There are additional prerequisites if you want to run visualizations of the sample applications, especially KillrWeather.

  1. Grafana - Used for dashboards.
  2. InfluxDB - InfluxDB is used to save the time series data for several applications, so it’s easy to graph with Grafana.
  3. Zeppelin - Used to run queries against Cassandra, but also useful for interactive Spark sessions to look at data in HDFS, etc.

See installation instructions below for these components.

Note: Cassandra, InfluxDB, Grafana, and Zeppelin are not supported components in Fast Data Platform. However, their use here demonstrates that the DC/OS ecosystem and the Fast Data Platform embrace the additional, third-party components that you need.

Installing the Sample Applications Docker Image

The fdp-package-sample-apps-1.0.0.zip archive contains Linux/MacOSX bash and Windows bat scripts and a JSON file for deploying the image in your cluster, using a pre-built image in Lightbend’s Docker Hub account.

Expand this archive distribution and run either run-fdp-apps-lab.sh or run-fdp-apps-lab.bat. Both use the included JSON file with the DC/OS CLI command dcos marathon app add fdp-apps-lab.json command.

In the DC/OS web console, you’ll see the fdp-apps-lab service running. It has an embedded web server, which you can see if you open the URL http://fdp-apps-lab.marathon.mesos/. The “home” page is just the nginx webserver default home page, but resources needed by the sample applications are staged there and served when running and deploying the apps.

See the fdp-package-sample-apps README has additional information.

Sample Apps Details

Now we’ll discuss each sample app. We’ll discuss the commands included in fdp-package-sample-apps-1.0.0 for deploying the prebuilt applications. We’ll also discuss their sources in other archives, such as fdp-sample-apps-1.0.0.zip (no package in the name).

The fdp-sample-apps-1.0.0.zip archive contains the full source code for most of the sample apps. The sources for fdp-killrweather and fdp-akka-kafka-streams-model-server are packaged separately, as discussed below.

Grafana, InfluxDB, and Zeppelin

  • Install Grafana if you want to visualize data from KillrWeather or the Network Intrusion app.
  • Install InfluxDB if you want to visualize data from KillrWeather.
  • Install Zeppelin if you want to query data in the Cassandra tables for KillrWeather.

To install Grafana, use the following commands in fdp-package-sample-apps-1.0.0:

$ cd fdp-package-sample-apps/bin/grafana
$ ./app-install.sh

To install InfluxDB, use the following commands in fdp-package-sample-apps-1.0.0:

$ cd fdp-package-sample-apps/bin/influxdb
$ ./app-install.sh

(Neither of these app-install.sh scripts offer a --help option, as they have no other optinos!)

To install Zeppelin, use the installation instructions here

Some further setup is required for each of these services. In some cases, a sample app will do the steps it requires automatically, like create InfluxDB and Cassandra tables. In other cases, manual steps are required, as discussed below on a case-by-case basis.

BigDL + Spark Application for VGG Neural Networks

This app demonstrates using Intel’s BigDL library for deep learning with Spark, where a VGG neural network is trained on the CIFAR-10 image dataset. Then the app runs standard validation tests on the trained model.

The Intel BigDL is a library for deep learning model training and serving on Spark.

NOTE: This application demonstrates using BigDL as a third-party library for machine learning. BigDL is not part of the FDP distribution and Lightbend does not provide support for BigDL.

To deploy the prebuilt application using fdp-apps-lab, use the following commands in fdp-package-sample-apps-1.0.0:

$ cd fdp-package-sample-apps/bin/bigdl
$ ./app-install.sh --help   # see the available options
$ ./app-install.sh

The source for this application is in the fdp-sample-apps-1.0.0/bigdl directory. To build and run it from source, see the fdp-sample-apps/bigdl/README.md file for details. It also describes what the application does in greater depth.

See also Network Intrusion Detection with Streaming K-Means, which also uses Spark ML.

Network Intrusion Detection with Streaming K-Means

The largest sample app in the fdp-sample-apps package uses Apache Spark’s MLlib implementation of the K-Means algorithm, which finds K clusters in a data set (for some integer K). While K-Means can be computed as a batch process over a fixed data set, Apache Spark Streaming can also be used to compute K-Means incrementally over streaming data. This app demonstrates how to use this algorithm in the context of looking for network traffic anomalies that might indicate intrusions. The application also functions as a general example of typical Spark application and how to run them in Fast Data Platform clusters.

Kafka is also used as part of the data pipeline.

The application has the following components that form stages of a pipeline:

  1. Data Ingestion: The first stage reads data from a folder which is configurable and watchable. You can put new files in the folder, and the file watcher will kick-start the data ingestion process. The first ingestion is however automatic and will be started 1 minute after the application installs.
  2. Data Transformation: The second stage reads the data from the Kafka topic populated in step 1, performs some transformations that will help in later stages of the data manipulation, and writes the transformed output into another Kafka topic. If there are any errors with specific records, these are recorded in a separate error Kafka topic. Stages 1 and 2 are implemented as a Kafka Streams application.
  3. Online Analytics and ML: This stage of the pipeline reads data from the Kafka topic populated by stage 2, sets up a streaming context in Spark, and uses it to do streaming K-means clustering to detect network intrusion. A challenge is to determine the optimal value for K in a streaming context, i.e., by training the model, then testing with a different set of data. (More on this below.)
  4. An implementation of batch K-means: Using this application, the user can iterate on the number of clusters (K) that should be used for the online anomaly detection part. The application accepts a batch duration and for all data that it receives in that interval of time, it runs K-means clustering in batch for all values of K that fall within the range as specified by the user. The user can specify the starting and ending values of K and the increment step size as command line arguments and the application will run k-means for the entire range and report the cluster score (mean squared error). The optimal value of K can then be found using the elbow method.
  5. Visualization with InfluxDB and Grafana: The resulting data is written to InfluxDB and graphed with Grafana. Both are installed as third-party components from DC/OS Catalog.

To deploy the prebuilt application using fdp-apps-lab, use the following commands:

$ cd fdp-package-sample-apps/bin/nwintrusion
$ ./app-install.sh --help   # see the available options
$ ./app-install.sh

Visualization of the results is done with a Grafana dashboard that needs to be setup. Install Grafana using the instructions above, then do the following steps:

  • Open the Grafana UI: http://leader.mesos/#/services/overview/%2Fgrafana/ (see note below)
  • Login with admin/admin
  • Click the Home button in the upper left-hand side
  • Click Import Dashboard
  • Import the JSON file fdp-package-sample-apps/bin/nwintrusion/nwintrusion-grafana-dashboard.json

Note: If the Grafana URL fails to work, open the DC/OS Services, click Grafana, click the one task running, and finally click the ENDPOINT URL shown. In a DC/OS EE cluster, you’ll need to change https to http in the last URL.

For more details on how to set up the visualization of this application, see the source code README, fdp-sample-apps/nwintrusion/README.md. Find the section Output of Running the App. The README also provides information on building and running from source.

See also Kafka Streams Sample App for Web Log Processing and Akka Streams and Kafka Streams Model Server, which also uses Kafka Streams. BigDL + Spark Application for VGG Neural Networks also uses Spark ML.

Kafka Streams Sample App for Web Log Processing

This application demonstrates the following features of Kafka Streams:

  1. Building and configuring a Streams based topology using the Kafka Streams DSL as well as the lower level processor based APIs
  2. Transformation semantics applied to streams data
  3. Stateful transformations using local state stores
  4. Interactive queries in Kafka streams applied to a distributed application
  5. Implementing custom state stores
  6. Interactive queries over custom state stores in a distributed setting

The KStreams app has 2 separate main applications, both accessible through http interfaces:

  1. The app based on the DSL APIs computes aggregate information from stateful streaming like the total number of bytes transferred to a specific host and the total number of accesses made on a specific host. These can be computed on a windowed aggregation as well.
  2. The app based on the Procedure APIs implements a custom state store in Kafka Streams to check for membership information. It uses a bloom filter to implement the store on top of the APIs that Kafka Streams provides. Then it consumes the Clarknet dataset and gives the user an HTTP interface to check if the application has seen a specific host in its pipeline or not.

Deploy the prebuilt application using fdp-apps-lab with the following commands:

$ cd fdp-package-sample-apps/bin/kstream
$ ./app-install.sh --help   # see the available options
$ ./app-install.sh ...

In the source distribution, fdp-sample-apps, see /kstream/README.md for information about the options you can pass to app-install.sh. The README also has information about building and running from source. It also describes what the application does in greater depth.

Note: At this time, the application doesn’t demonstrate the new KSQL interface, which was pre-release at the time of this writing.

See also Network Intrusion Detection with Streaming K-Means and Akka Streams and Kafka Streams Model Server, which also uses Kafka Streams.

Predicting Taxi Travel Times with Flink

This application is adapted to Fast Data Platform from the publicly-available Flink training from dataArtisans. It uses a public dataset of taxi rides in New York City. The details of the dataset can be found here. The application trains a regression model on the taxi data to create a classifier that predicts how long a ride will take.

Deploy the prebuilt application using fdp-apps-lab with the following commands:

$ cd fdp-package-sample-apps/bin/flink
$ ./app-install.sh --help   # see the available options
$ ./app-install.sh ...

The source is in the fdp-sample-apps/flink directory. See the fdp-sample-apps/flink/README.md file for more information.

Akka Streams and Kafka Streams Model Server

This application illustrates one solution to a common design problem, how to serve Data Science models in a streaming production system.

Deploy the prebuilt application using fdp-apps-lab with the following commands:

$ cd fdp-package-sample-apps/bin/modelserving
$ ./app-install.sh --help   # see the available options
$ ./app-install.sh ...

The fdp-akka-kafka-streams-model-server-1.0.0.zip archive has the source code and detailed instructions for this large sample application. See the README for extensive instructions and details about this application.

KillrWeather

The fdp-killrweather-1.0.0.zip archive contains the most comprehensive example app, a port to Fast Data Platform of the well-known demonstration app, KillrWeather, that combines Kafka, Spark Streaming, Cassandra, and Akka-based middleware for stream processing.

While the source code is packaged separately, the prebuilt binaries are also packaged in the sample apps laboratory.

The architecture of KillrWeather is shown in the following diagram:

KillrWeather

Historical weather data is fed to a Kafka topic using one of three clients, one that uses the new GRPC protocol, one that uses the Kafka native API, and one that uses REST. From that topic, data is read by Spark Streaming app coordinated by an Akka-based service, where statistics over the data are computed. The raw data and the results are written to two places. Output to Cassandra is an example of a longer-term storage option suitable for downstream use. To see the data in Kafka, instructions are provided for using a Zeppelin notebook configured with Cassandra Query Language support. Writing the output to InfluxDB, a time-series database, supports graphing the raw and processed data in real time using Grafana.

Note: Make sure you have deployed Cassandra, InfluxDB, and Grafana before deploying KillrWeather (see instructions above). At start up, KillrWeather will detect their presence and automatically define tables in Cassandra and InfluxDB, and define a dashboard in Grafana. Zeppelin does not need to be installed in advance.

Deploy the prebuilt application using fdp-apps-lab with the following commands:

$ cd fdp-package-sample-apps/bin/killrweather
$ ./app-install.sh --help   # see the available options
$ ./app-install.sh

By default, it deploys four services:

  • killrweatherapp - The central service running the Spark Streaming job. (There is an alternative killrweatherappstructured, which uses Spark Structured Streaming, that you can run using a command-line flag.)
  • killrweatherloader - A service that reads weather report data (staged in the nginx server) and publishes it to Kafka using the Kafka producer API.
  • killrweathergrpcclient - A service that accepts input data over the GRPC protocol and pushes it to Kafka.
  • killrweatherhttpclient - A service that accepts input data over the HTTP protocol and pushes it to Kafka.

The source code archive, fdp-killrweather-1.0.0.zip README describes in more detail the setup steps performed automatically for InfluxDB, Cassandra, and Grafana. There are manual steps required for Zeppelin, if you want to use Zeppelin to view the contents of Cassandra tables. Here we reproduce the steps described in the README.

First, follow these instructions to configure Cassandra in Zeppelin.

The most important configuration settings are the following:

Name Value
cassandra.cluster cassandra
cassandra.hosts node.cassandra.l4lb.thisdcos.directory
cassandra.native.port 9042

Create a notebook, then try adding the following, one “pair” of %cassandra and SQL query per notebook cell:

%cassandra
use isd_weather_data;

%cassandra
select day, wsid, high, low, mean,stdev,variance from daily_aggregate_temperature WHERE year=2008 and month=6 allow filtering;

%cassandra
select wsid, day, high, low, mean,stdev,variance from daily_aggregate_pressure  WHERE year=2008 and month=6 allow filtering;

%cassandra
select wsid, month, high, low, mean,stdev,variance from monthly_aggregate_temperature WHERE year=2008  allow filtering;

%cassandra
select wsid, month, high, low, mean,stdev,variance from monthly_aggregate_pressure WHERE year=2008  allow filtering;

Zeppelin will graph the results for you.