Sample Applications

Fast Data Platform comes bundled with several sample applications that illustrate how to use the components included in Fast Data Platform to solve particular application goals. The sample applications come with scripts for building and launching the applications in your Fast Data Platform cluster. They are also packaged in pre-built Docker images for convenient execution.

Hence, a quick way to learn how to use Fast Data Platform for your own applications is to start with one of the included sample applications.

For our purposes, there are three considerations:

  • What tools to pick for a particular problem domain.
  • How to write applications using those tools.
  • How to deploy and run those applications in Fast Data Platform clusters.

For this release, we offer the following sample applications, packaged in the fdp-sample-apps archive that’s part of the distribution.

  • Kafka Streams (package fdp-sample-app, kstream directory)
  • Predicting Taxi Travel Times with Flink (package fdp-sample-apps, flink directory)
  • BigDL VGG Training ML app (package fdp-sample-apps, bigdl directory)
  • Network intrusion detection with Spark Streaming K-Means (package fdp-sample-apps, nwintrusion directory)

In additional, we have a separate distribution of a more complete application, FDP KillrWeather, which is a port to Fast Data Platform of the well-known demonstration app, KillrWeather. This application is found in the fdp-killrweather archive.

Each of the examples has detailed instructions in README files. Here, we’ll only provide an overview.

Preliminaries

The sample applications are provided in two forms, as a source distribution (a snapshot of their respective git repos) and as a packaged Docker image. We recommend using the image to install the apps in your Fast Data Platform cluster, then use the source distribution for studying the applications and creating modifications.

We’ll use the following directory names as “root” locations where you have the “laboratory” and the source code distributions:

  • fdp-package-sample-apps: where you expanded the fdp-package-sample-apps distribution for running the sample apps “laboratory”. It also has other tools we’ll use below.
  • fdp-sample-apps: where you expanded the fdp-sample-apps source distribution.
  • fdp-killrweather: where you expanded the separate fdp-killrweather source distribution.

Prerequisites

Before you use the sample applications, the following prerequisites must be met:

  1. Lightbend Fast Data Platform cluster is up and running.
  2. The DC/OS CLI is installed on your workstation. See the section Install DC/OS CLI on Your Local Computer for more details.
  3. The corresponding DC/OS CLI command plugins for Kafka, HDFS, and Spark are also installed. The Cassandra DC/OS CLI is also required for KillrWeather.
  4. Services for Kafka, HDFS, and Spark, installed using Fast Data Platform Manager. Cassandra is also required for KillrWeather, which you install from the DC/OS Catalog.

There are additional prerequisites if you want to run visualization of the sample applications. Here we use tools that are not supported by Lightbend as part of Fast Data Platform, but are easy to use in the DC/OS environment.

  1. InfluxDB - InfluxDB is used to save the time series data for several applications so it’s easy to graph with Grafana. While you could install InfluxDB from the DC/OS Catalog, the installation needs to be done with the companion distribution that comes with Fast Data Platform, fdp-influxdb-docker-images. (See its README file.) It deploys InfluxDB creates databases and retention policies needed for the sample applications.
  2. Grafana - Used for dashboards. Install Grafana from the DC/OS Catalog.
  3. Zeppelin - Used to run queries against Cassandra.

Additional configuration steps for InfluxDB, Grafana, Cassandra, and Zeppelin are required by some of the applications, which will be discussed below.

Installing the Sample Applications Docker Image

The fdp-package-sample-apps-0.8.1.zip distribution that comes with Fast Data Platform contains Linux/MacOSX bash and Windows bat scripts and a JSON file for deploying the image in your cluster, using a pre-built image in Lightbend’s Docker Hub account.

Expand this archive distribution and run either run-fdp-apps-lab.sh or run-fdp-apps-lab.bat. Both use the included JSON file with the DC/OS CLI command dcos marathon app add fdp-apps-lab.json command.

In the DC/OS web console, you’ll see the fdp-apps-lab service running. It has an embedded web server, which you can see if you open the URL http://fdp-apps-lab.marathon.mesos/. The “home” page is just the nginx web server default home page, but resources needed by the sample applications are staged there and served when running and deploying the apps.

See the fdp-package-sample-apps README for more information.

Source Distribution

The fdp-sample-apps-0.8.1.tar.gz archive (no package in the name) contains the full source code for the sample apps, except for the fdp-killrweather application, which is discussed separately below.

Kafka Streams

This application demonstrates the following features of Kafka Streams:

  1. Building and configuring a Streams based topology using the Kafka Streams DSL as well as the lower level processor based APIs
  2. Transformation semantics applied to streams data
  3. Stateful transformations using local state stores
  4. Interactive queries in Kafka streams applied to a distributed application
  5. Implementing custom state stores
  6. Interactive queries over custom state stores in a distributed setting

The KStreams app has 2 separate main applications, both accessible through http interfaces:

  1. The app based on the DSL APIs computes aggregate information from stateful streaming like the total number of bytes transferred for a specific host and the total number of accesses made on a specific host. These can be computed on a windowed aggregation as well.
  2. The app based on the Procedure APIs implements a custom state store in Kafka Streams to check for membership information. It uses a bloom filter to implement the store on top of the APIs that Kafka Streams provides. Then it consumes the Clarknet dataset and gives the user a HTTP interface to check if the application has seen a specific host in its pipeline or not.

With the fdp-apps-lab "laboratory running, you can deploy this sample app using the following commands:

$ cd fdp-package-sample-apps/bin/kstream
$ ./app-install.sh --start-only dsl --start-only procedure

To build and run it from source, see the fdp-sample-apps/kstream/README.md file for more details. It also describes what the application does in greater depth.

Note: At this time, the application doesn’t demonstrate the new KSQL interface, which is pre-release.

See also Network Intrusion Detection with Streaming K-Means below, which also uses Kafka Streams.

BigDL + Spark Application

This application is in the bigdl directory. It demonstrates using Intel’s BigDL library for deep learning with Spark, where a VGG neural network is trained on the CIFAR-10 image dataset. Then the app runs standard validation tests on the trained model.

The Intel BigDL is library for deep learning model training and serving on Spark.

NOTE: This applications demonstrates using BigDL as a third-party library for machine learning. BigDL is not part of the FDP distribution and Lightbend does not provide support for BigDL or applications that use it.

With the fdp-apps-lab "laboratory running, you can deploy this sample app using the following commands:

$ cd fdp-package-sample-apps/bin/bigdl
$ ./app-install.sh

To build and run it from source, see the fdp-sample-apps/bigdl/README.md file for more details. It also describes what the application does in greater depth.

Network Intrusion Detection with Streaming K-Means

The largest sample app in the fdp-sample-apps package uses Apache Spark’s MLlib implementation of the K-Means algorithm, which finds K clusters in a data set (for some integer K). While K-Means can be computed as a batch process over a fixed data set, Apache Spark Streaming can also be used to compute K-Means incrementally over streaming data. This app demonstrates how to use this algorithm in the context of looking for network traffic anomalies that might indicate intrusions. The application also functions as a general example of typical Spark application and how to run them in Fast Data Platform clusters.

Kafka is also used as part of the data pipeline.

The application has the following components that form stages of a pipeline:

  1. Data Ingestion: The first stage reads data from a folder which is configurable and watchable. You can put new files in the folder and the file watcher will kick start the data ingestion process. The first ingestion is however automatic and will be started 1 minute after the application installs.
  2. Data Transformation: The second stage reads the data from the Kafka topic populated in step 1, performs some transformations that will help in later stages of the data manipulation, and writes the transformed output into another Kafka topic. If there are any errors with specific records, these are recorded in a separate error Kafka topic. Stages 1 and 2 are implemented as a Kafka Streams application.
  3. Online Analytics and ML: This stage of the pipeline reads data from the Kafka topic populated by stage 2, sets up a streaming context in Spark, and uses it to do streaming K-means clustering to detect network intrusion. A challenge is to determine the optimal value for K in a streaming context, i.e., by training the model, then testing with a different set of data. (More on this below.)
  4. An implementation of batch K-means: Using this application, the user can iterate on the number of clusters (K) that should be used for the online anomaly detection part. The application accepts a batch duration and for all data that it receives in that duration it runs K-means clustering in batch for all values of K that fall within the range as specified by the user. The user can specify the starting and ending values of K and the increment step size as command line arguments and the application will run k-means for the entire range and report the cluster score (mean squared error). The optimal value of K can then be found using the elbow method.
  5. Visualization with InfluxDB and Grafana: The result data is written to InfluxDB and graphed with Grafana. Both are installed as third-party components from DC/OS Catalog.

With the fdp-apps-lab "laboratory running, you can deploy this sample app using the following commands:

$ cd fdp-package-sample-apps/bin/nwintrusion
$ ./app-install.sh

Visualization of the results is done with a Grafana dashboard that needs to be setup. After installing Grafana, as discussed above, then do the following:

For more details on how to set up the visualization of this application, see the fdp-sample-apps/nwintrusion/README.md. Find the section Output of Running the App. It also provides information on building and running from source.

Predicting Taxi Travel Times with Flink

WARNING: This app is currently not compatible with DC/OS 1.10. This will be fixed in the next release. Also, Flink integration is offered as a “preview” in Fast Data Platform. Use with caution. For these reasons, this app is not provided through fdp-apps-lab.

This application is in the fdp-sample-apps/flink directory. It is adapted to Fast Data Platform from the publicly-available Flink training from dataArtisans. It uses a public dataset of taxi rides in New York City. The details of the dataset can be found here. The application trains a regression model on the the taxi data to create a classifier that predicts how long a ride will take.

See the fdp-sample-apps/flink/README.md file for more information.

KillrWeather

The fdp-killrweather-0.8.1 archive contains the most comprehensive example app, a port to Fast Data Platform of the well-known demonstration app, KillrWeather. It demonstrates stream processing using Kafka, Spark Streaming, Cassandra, and Akka-based middleware.

The architecture of KillrWeather is shown in the following diagram:

KillrWeather

Historical weather data is fed to a Kafka topic using one of three clients, one that uses the new GRPC protocol, one that uses the Kafka native API, and one that uses REST. From that topic, data is read by Spark Streaming app coordinated by an Akka-based service, where statistics over the data are computed. The raw data and the results are written to two places. Output to Cassandra is an example of a longer-term storage option suitable for downstream use. To see the data in Kafka, instructions are provided for using a Zeppelin notebook configured with Cassandra Query Language support. Writing the output to InfluxDB, a time-series database, supports graphing the raw and processed data in real time using Grafana.

With the fdp-apps-lab "laboratory running, you can deploy this sample app using the following commands:

$ cd fdp-package-sample-apps/bin/killrweather
$ ./app-install.sh

It deploys four services:

  • killrweatherapp - The central service running the Spark Streaming job.
  • killrweatherloader - A service that reads weather report data (staged in the nginx server) and publishes it to Kafka using the Kafka producer API.
  • killrweathergrpcclient - A service that accepts input data over the GRPC protocol and pushes it to Kafka.
  • killrweatherhttpclient - A service that accepts input data over the HTTP protocol and pushes it to Kafka.

There are some additional setup steps required for InfluxDB, Cassandra, and optionally, Zeppelin. Some are actually done automatically. See the fdp-killrweather/README.md file for the details, as well as information on building the app yourself and running locally.