Cloudflow streamlets

Each streamlet has a unique canonical class name and metadata properties such as a description and a set of labels. A streamlet can have inlets and outlets. A streamlet should have at least one inlet or outlet. Each inlet is identified by the following:

  • a codec indicating the type of data it accepts. In Cloudflow we currently support Avro as the codec

  • the class of data it accepts used as the type parameter in defining the inlet

  • a name specified as part of the inlet definition

Similarly all outlets have the same attributes as the inlets. In addition every outlet also needs to specify a partitioning function as part of its definition. This forms the logic based on which data will be partitioned when writing to Kafka downstream. Data partitioning in Kafka ensures scalability.

Each streamlet in the application blueprint is added by giving it a unique "reference" name, which identifies it. Multiple instances of the same streamlet can be added to the blueprint as long as they have different reference names.

Two streamlets can be connected together by connecting the inlet of one streamlet to a compatible outlet of another streamlet. Compatibility of inlets with outlets is purely based on the compatibility of their associated schemas. An inlet with a schema A can only be connected to an outlet that has a schema that is compatible with schema A. This compatibility is verified by the Cloudflow sbt plugin.

Each outlet can have multiple downstream inlets connected to it. Every inlet reads data from the outlet independently. An example of a blueprint with multiple inlets connected to the same outlet would look like this:

    ...
    validation.out-1 = [
      logger2.in,
      other-streamlet.in
    ]
    ...

For more details on how to develop, compose, configure, deploy, and operate streamlets in an application, see Developing Cloudflow streamlets.

Logic defines the business process that the streamlet is supposed to execute. For a streamlet that validates incoming streaming data and classifies into valid and invalid records, the exact logic of validation is the business process and must be defined as part of the streamlet logic.

Depending on the shape we can think of various types of streamlets which can conceptually map to elements of a streaming data pipeline. In the following sections we discuss a few of the commonly used ones.

Streamlet shapes

The inlets and outlets define the shape of a streamlet. A streamlet that’s supposed to accept streaming data of a specific type and split into valid and invalid records can have one inlet and two outlets (one for valid and the other for invalid records). Cloudflow offers APIs to build streamlets of various shapes.

Ingress

An Ingress is a streamlet with zero inlets and one or more outlets. An ingress could be a server handling requests, like the HttpIngress that comes standard with Cloudflow. It could also be polling some back-end for data or it could even be a simple test data generator.

Ingress
Figure 1. Ingress

Processor

A Processor has one inlet and one outlet. Processors represent common data transformations like map and filter, or any combination of them.

Processor
Figure 2. Processor

FanOut

FanOut-shaped streamlets have a single inlet and two or more outlets.

FanOut
Figure 3. FanOut

FanIn

FanIn-shaped streamlets have a single outlet and two or more inlets.

FanIn
Figure 4. FanIn

Egress

An Egress represents data leaving the Cloudflow application. For instance this could be data being persisted to some database, notifications being sent to Slack, files being written to HDFS, etc.

Egress
Figure 5. Egress

Schemas

Inlets and outlets of streamlets are explicitly typed, e.g. they only handle data that conform to specific Avro new tab schemas. Cloudflow takes a schema-first approach:

  • Avro schemas need to be defined for every inlet and outlet of the streamlet. The Cloudflow sbt plugin generates Java or Scala classes from the Avro schemas.

  • The generated types are used when defining the inlets and outlets of a streamlet.

  • Inlets and outlets can only be connected when their schemas are compatible.

For more information on schema compatibility and resolution, see Avro Schema Resolution new tab.

Data safety

As mentioned earlier, streamlets are deployed by the Cloudflow operator as self-contained, isolated services. Data that flows between different streamlets will be serialized and persisted. Data flowing between inlets and outlets is guaranteed, within some constraints, to be delivered at least once, with a minimum of potential duplicated records during live restarts or upgrades. More details about data safety can be found in the development guides for the different runtimes.

Designing for performance

Of course data persistence between streamlets adds overhead and it is important to realize this when designing your applications. Several individual data processing operations can be combined into one single streamlet or spread out over multiple streamlets. Which design is best will depend on your use case and, for instance, on whether there is any value in the intermediate data produced. Intermediate data from an outlet can be consumed by multiple downstream streamlets instead of just one; each streamlet will receive all of the data.

For example, let’s look at the following blueprint:

Original Blueprint
Figure 6. Original Blueprint

Data transformation and validation are done in two separate steps. If the output of the initial transformation is not useful, e.g. if no other streamlets will consume that stream, then the map and validate streamlet can be combined into a single streamlet to save the overhead of (de)serialization and persistence:

Updated Blueprint
Figure 7. Updated Blueprint

For Lightbend Platform subscribers, Lightbend Console provides a way to observe and validate your application’s behavior. See Monitoring Cloudflow applications for more information.