Each streamlet has a unique canonical class name and metadata properties such as a description and a set of labels. A streamlet can have inlets and outlets. A streamlet should have at least one inlet or outlet. Each inlet is identified by the following:
a codec indicating the type of data it accepts. In Cloudflow we currently support Avro as the codec
the class of data it accepts used as the type parameter in defining the inlet
a name specified as part of the inlet definition
Similarly all outlets have the same attributes as the inlets. In addition every outlet also needs to specify a partitioning function as part of its definition. This forms the logic based on which data will be partitioned when writing to Kafka downstream. Data partitioning in Kafka ensures scalability.
Each streamlet in the application blueprint is added by giving it a unique "reference" name, which identifies it. Multiple instances of the same streamlet can be added to the blueprint as long as they have different reference names.
Two streamlets can be connected together by connecting the inlet of one streamlet to a compatible outlet of another streamlet. Compatibility of inlets with outlets is purely based on the compatibility of their associated schemas. An inlet with a schema
A can only be connected to an outlet that has a schema that is compatible with schema
A. This compatibility is verified by the Cloudflow sbt plugin.
Each outlet can have multiple downstream inlets connected to it. Every inlet reads data from the outlet independently. An example of a blueprint with multiple inlets connected to the same outlet would look like this:
... validation.out-1 = [ logger2.in, other-streamlet.in ] ...
For more details on how to develop, compose, configure, deploy, and operate streamlets in an application, see Developing Cloudflow streamlets.
Logic defines the business process that the streamlet is supposed to execute. For a streamlet that validates incoming streaming data and classifies into valid and invalid records, the exact logic of validation is the business process and must be defined as part of the streamlet logic.
Depending on the shape we can think of various types of streamlets which can conceptually map to elements of a streaming data pipeline. In the following sections we discuss a few of the commonly used ones.
The inlets and outlets define the shape of a streamlet. A streamlet that’s supposed to accept streaming data of a specific type and split into valid and invalid records can have one inlet and two outlets (one for valid and the other for invalid records). Cloudflow offers APIs to build streamlets of various shapes.
An Ingress is a streamlet with zero inlets and one or more outlets. An ingress could be a server handling requests, like the
HttpIngress that comes standard with Cloudflow. It could also be polling some back-end for data or it could even be a simple test data generator.
A Processor has one inlet and one outlet. Processors represent common data transformations like
filter, or any combination of them.
Inlets and outlets of streamlets are explicitly typed, e.g. they only handle data that conform to specific Avro schemas. Cloudflow takes a schema-first approach:
Avro schemas need to be defined for every inlet and outlet of the streamlet. The Cloudflow sbt plugin generates Java or Scala classes from the Avro schemas.
The generated types are used when defining the inlets and outlets of a streamlet.
Inlets and outlets can only be connected when their schemas are compatible.
For more information on schema compatibility and resolution, see Avro Schema Resolution .
As mentioned earlier, streamlets are deployed by the Cloudflow operator as self-contained, isolated services. Data that flows between different streamlets will be serialized and persisted. Data flowing between inlets and outlets is guaranteed, within some constraints, to be delivered at least once, with a minimum of potential duplicated records during live restarts or upgrades. More details about data safety can be found in the development guides for the different runtimes.
Of course data persistence between streamlets adds overhead and it is important to realize this when designing your applications. Several individual data processing operations can be combined into one single streamlet or spread out over multiple streamlets. Which design is best will depend on your use case and, for instance, on whether there is any value in the intermediate data produced. Intermediate data from an outlet can be consumed by multiple downstream streamlets instead of just one; each streamlet will receive all of the data.
For example, let’s look at the following blueprint:
Data transformation and validation are done in two separate steps. If the output of the initial transformation is not useful, e.g. if no other streamlets will consume that stream, then the
validate streamlet can be combined into a single streamlet to save the overhead of (de)serialization and persistence:
For Lightbend Platform subscribers, Lightbend Console provides a way to observe and validate your application’s behavior. See Monitoring Cloudflow applications for more information.