Planning for streaming data and analysis

Factors to consider for cluster planning and Sizing recommendations provides information applicable to streaming data applications. In addition, for streaming data, be sure to allocate enough resources for the following:

  • CPU

    A compute-intensive analytics application, like machine learning training or scoring, will often require significant CPU resources. In general, recent Spark performance research has shown that CPU bottlenecks are more common than I/O bottlenecks.

    This will be true, in general, for non-trivial analytics with any tool, although less so for "simple" ETL-type processing. For deep learning applications, GPUs are usually recommended.

  • Memory

    Spark jobs often have enormous memory requirements such as when doing large join operations and when large datasets are cached in memory for efficient access. These caches are an example of state maintained in the running process. Simpler jobs that require little state will require far less memory, such as ETL jobs that parse records and filter bad ones, but don’t do aggregations nor require large, in-memory reference data.

  • Network I/O

    Network bandwidth is rarely a significant issue in modern streaming applications, since modern cluster networks are very fast and bottlenecks usually occur elsewhere, such as the cluster environment overhead, virtualization, the CPU load, the overhead of moving data through Kafka (due to persistence to disk), etc. In fact, the old Hadoop mantra that you should colocate compute and storage is largely unnecessary these days.

  • Disk I/O

    As discussed for Kafka requirements, it is best to use SSDs where performance critical disk I/O is required, like for Kafka and smaller databases. Tools like Spark that use file systems for caching and checkpointing state, may also require faster disk I/O, if this process is done frequently. These cached data sets can be small or quite large, so ensure that the underlying file system is flexible about handling lots of small files (a known issue for HDFS, for example), as well as files that could be as large as 10s of GBytes, for very large Spark jobs.

Ultimately, if you have sensitive performance requirements, either tight SLAs or the desire to keep costs tightly constrained, you will have to profile your applications (or reasonable approximations of the planned applications) to understand the optimal configurations required.