Kafka Clusters

Kafka brokers are the master services in a Kafka cluster. The ideal number of Kafka broker nodes depends on the number of concurrent producers and consumers, and the rates at which they write and read data, respectively. More brokers means more resiliency in the case of node failures. As use of Kafka grows, the number of Kafka brokers should increase accordingly.

The following table outlines general Kafka broker recommendations. Later sections describe how to calculate specific needs for your environment.

Table 1. Kafka Broker Resource Recommendations
Resource Amount

Nodes

5 or more

Processor

4-16+ cores

Memory

32-64 GB

Hard Disk

1+TB (SSDs recommended)

Brokers are not CPU-bound under most production workloads. When TLS is used within the cluster and for client connections, then CPU will trend higher to support the requisite crypto operations.

Kafka Broker Disks

Because Kafka persists messages to log segment files, excellent disk I/O performance is a requirement. Writing Kafka logs to multiple disks improves performance through parallelism. Since Kafka requires that an entire partition fit on a single disk, this is an upper bound on the amount of data that can be stored by Kafka for a given topic partition. Hence, the number of topics, the number of partitions per topic, the size of the records, etc. all affect the log retention limits. So, if you have a high volume partition, allocate it to a dedicated drive. For resilience, also be sure to allocate enough disk space cluster-wide for partition replicas.

Our own testing using Kafka with SSD drives demonstrates that write performance around 1.2GB/sec plus or minus 400MB/sec is easily achievable, staying above 1.0GB/sec most of the time. These results are similar to those reported by the Sangrenel project for Kafka performance testing.

Obviously, such rates can fill up any drive quickly, causing potential data loss and other problems. We’ve observed that broker disk quota settings appear to be ignored. Hence, this mechanism doesn’t prevent Kafka from filling a drive. Based on the anticipated data rate, you can calculate how much disk space may be required and how to trade off the following variables to meet your needs and keep Kafka healthy:

  • Disk size - Multi-TB hard drives have more capacity compared to SSDs, but trade off read and write performance.

  • Number of partitions - Using more partitions increases throughput per topic through parallelism. It also allows more drives to be used, one per partition, which increases capacity. Note that Kafka only preserves message order per partition, not over a whole, multi-partition topic. Topic messages can be partitioned randomly or by hashing on a key.

  • Replication factor - The replication factor provides resiliency for a partition. Replicas are spread across available brokers, therefore your replication factor is always <= the number of brokers in your cluster. It must be accounted for when determining disk requirements because there is no difference between how much space a leader partition and a replica partition uses. The total number of partitions for a given a topic is the number of partitions * the replication factor.

  • Dedicated drives - Using dedicated drives for the Kafka brokers not only improves performance, by removing other activity from the drive, but should the drive become full, then other services that need drive space will not be adversely affected.

  • Do not use RAID for fault tolerance. Kafka provides fault tolerance guarantees at the application layer (with topic replication factor) making fault tolerance at the hard disk level redundant.

Kafka does not degrade gracefully if it runs out of disk space. Make sure your sizing estimates are generous, monitor disk utilization, and take corrective action well before disk exhaustion occurs! In a default deployment Kafka can consume all disk space available on the disks it accesses, which not only causes Kafka to fail, but also the other applications and services on the node.

To estimate disk requirements for your need, per broker, begin by estimating the projected usage. Here is an example:

Table 2. Kafka Broker Disk Considerations
Quantity Example Discussion

Average topic message throughput (MT)

1000 msg/sec

For each topic, how many messages on average per unit time?

Average message size (MS)

10 KB

Note that Kafka works best with messages under 1MB in size.

Broker message retention length (RL)

1 week (168 hours or 604800 seconds)

This could also be a size.

Number of topics (NT)

1

Average number of topic partitions (NP)

3

Or do a more specific calculation summing each topics number of partitions.

Average replication factor (NR)

3

Or do a more specific summarization

Number of Kafka Brokers (KB)

9

With these variables, the estimated disk size per broker is calculated as follows:

Disk per broker = ( (MT / NP) * MS * RL * NT * NP * NR ) / KB

Which can be simplified to this:

Disk per broker (simplified) = ( MT * MS * RL * NT * NR ) / KB

Using our example numbers:

( 1000 msg/s * 10KB * 1 week * 1 topic * 3 replicas ) / 9 brokers
( 10 MB/s * 604800 seconds * 3 replicas ) / 9 brokers
18.81 TB / 9 brokers = ~2.1 TB per broker.

This calculation is just an estimate. It assumes partitions are the same size and evenly distributed across all brokers. While Kafka will attempt to assign partitions evenly at topic creation time, it’s extremely likely that some partitions will end up being larger than others if a non-trivial partitioning strategy is used. In this case you will likely need to rebalance partitions at some point (an attempt to evenly distribute partitions across available brokers). It’s important to pad your estimate to account for this reality, as well as to accommodate future growth.

Topic granularity and retention

The following trade-offs also impact the amount of disk space you will need:

  • Topic granularity - Consider using fine-grained topics, rather than more partitions, when each client needs to see all messages in a topic to track the evolving state encapsulated in those messages. This is a balance between supporting high volume workloads where more partitions offer an advantage or application message bus workloads where topics may describe very specific entities within the domain and topic granularity would be ideal. However, keep in mind that many applications only require strict ordering by key (for example, the activity for customer accounts, where the account id is the key). If the topic is partitioned by hashing on the key, then this requirement will be met, even without total ordering over the entire topic.

  • Retention policy - The longer the retention policy, the more disk space required. For high-volume topics, consider moving the data to downstream, longer-term storage as quickly as possible and use a more aggressive retention policy. An alternative to fixed time or partition size retention is compaction. Compaction only works with keyed messages. Messages produced with the same key as a previous message in the partition will "update" the value associated with that key. Consumers will only see the latest version of the message when they consume the partition. The log cleaner process will "compact" the log on a regular basis and as a result reclaim disk space. Compaction has compelling use cases but is not suitable in situations where you want a clean immutable event log. See the Apache Kafka documentation on Compaction for more details.

Let’s discuss estimating retention time. If you already have servers with fixed disks, you could modify the formula and choose to solve for a maximum time-based message retention length.

Starting with the previous approximate value for the disk per broker (DB): 2 TB (rounding down…), the maximum retention time is as follows:

Max retention time = ( DB * KB ) / ( (MT / NP) * MS * NT * NP * NR )

Which can be simplified to this:

Max retention time (simplified) = ( DB * KB ) / ( MT * MS * NT * NR )

Using our example numbers:

(2 TB * 9 brokers) / ( 1000 msg/s * 10 KB * 1 topic * 3 replicas)
18 TB / ( 10 MB/s * 1 topic * 3 replicas)
18 TB / 30 MB/s = 600000 seconds = ~167 hours = ~1 week

Kafka Broker Memory

Server memory is the most important resource for a broker to be able to support a high volume and low latency workload. The broker runs as a JVM process and does not require much system memory itself. Lightbend Platform’s default broker configuration allocates a maximum of 5GB of JVM heap space. However, the reason a lot of memory is ideal for the broker is because the broker will use the operating system’s Page Cache to store recently-used data.

Page Cache (aka Disk Cache) is memory allocated by the operating system to be used to store pages of data that were recently read from or written to disk. Brokers perform an operation called a zero-copy transfer to efficiently store recently produced messages in the page cache so it can be read quickly by low latency consumers. Zero-copy is what gives Kafka such low latency characteristics. Kafka messages are also stored in a standardized binary message format that is shared between the producer, broker, and consumer. This lets data be transferred to each participant without any modification required. What all this means is that under normal workloads produced messages are written once to Page Cache (real memory) and made available to consumers immediately without any mutation, serialization, byte-copying, or other overhead costs.

The efficient use of Page Cache allows brokers to swamp downstream consumers with more data than can be sent over the network, making network throughput the limiting performance factor. Writing Page Cache to disk is performed by the operating system and not the broker. If Page Cache flushes do not occur frequently enough then there’s a risk that the Page Cache will get full and as a result, block writes to disk, which creates a big problem in terms of write latency. Make sure your operating system is configured to flush Page Cache to disk fast and efficiently. See the Kafka broker Operating System section for guidance on configuring your operating system’s behavior when writing Page Cache to disk.

To support the efficient use of Page Cache it’s important to allocate a respectable amount of memory to broker processors over and above what is allocated to the broker JVM process heap space. For most production workloads consider starting with the minimum memory requirements defined in the Kafka Broker section, 32GB.

See the Apache Kafka documentation’s sections on Persistence, Efficiency, and Understanding Linux OS Flush Behaviour sections for more information about how brokers utilize disk, zero-copy, Page Cache, and system memory.

Changing broker JVM heap space

The Apache Kafka distribution by default uses a maximum of 1GB of heap space out of the box, which should be more than enough for local development and testing purposes. For production, it’s recommended to start with 5GB and to tune that number based on monitoring of its utilization. If there’s not enough heap space for the broker, then you will observe JVM Out Of Memory exceptions in the broker logs and the broker will fail. The most likely reason for running out of memory is if you have an above-average number of partitions or larger message size than normal. The more partitions you have the more messages can be produced to or consumed from Kafka in parallel. The maximum size of messages (defined by the broker property message.max.bytes, default 1MB) will result in larger message representations in the broker JVM.

You can define the amount of heap space assigned to each broker at package install time using the YAML configuration file.

Broker operating system configuration

Kafka relies on the operating system configuration related to virtual memory and files to function efficiently. On Linux operating systems these settings are applied with sysctl and made permanent when persisted in /etc/sysctl.conf. Lightbend recommends the following for OS configuration:

  • Virtual Memory Swapping

    Virtual memory swapping should be tuned to occur rarely. Physical memory is so cheap today that swapping memory to disk is usually unnecessary and incurs significant latencies. Hence, it should be avoided. However, most operating systems still have the option with a default setting that is not ideal for Kafka brokers. Virtual memory swapping on Linux is configured with vm.swappiness. Usually, it has a default setting of 60%. The higher this number, the more aggressively the system will swap to disk to conserve space in memory. It’s recommended to set your vm.swappiness setting to 1 which instructs your system to do the minimum amount of swapping, such as a last resort effort to avoid an Out Of Memory condition. If you observe your broker swapping to disk regularly, it is a sign you should evaluate this setting, related virtual memory settings, and consider adding more brokers or adding more physical memory to existing broker nodes.

  • Page Flushing Configuration, i.e., Writing Page Cache to Disk

    The importance of the Page Cache in Kafka was described in the Broker Memory section above. Due to the broker’s heavy reliance on the Page Cache, it’s important to optimize how frequently the system writes "dirty" pages (i.e., pages updated in memory but not yet written to disk). The vm.dirty_background_ratio setting is the percentage of memory that can contain dirty pages before the system will trigger asynchronous operations to persist them to disk. Most Linux systems generally set this to 10% (10), which is reasonable for most workloads, but it would be wise to use 5 or lower for Kafka broker nodes. The higher this value, the higher the risk of losing unpersisted data in the event of failure. Therefore your requirements about message durability will help guide you on how low to set this configuration.

    The vm.dirty_ratio setting is crucial for low latency workloads. This setting determines the maximum percentage of memory occupied by dirty pages before the system is forced to synchronously write pages to disk. When the system reaches this limit, all subsequent writes to the Page Cache will be blocked, which has obvious implications for the latency of your messages. Most Linux systems generally set this to 20% by default, which is too low for most production workloads. Consider setting this to 50% (50) as a starting point.

  • File Descriptor Limits

    By default, many Linux operating systems configure a conservative number of allowed open file descriptors. This setting controls how many file references the operating system can keep active at any given time. Brokers make efficient use of the file system, but require a large number of open file descriptors to accommodate references to log segment files, log index files, and log time index files for all topic partitions that the broker contains. There are few consequences of increasing this limit. It’s recommended to define a limit of 100000 as a starting point and monitor broker and system logs in case this limit is met. The file descriptor limit is configured with fs.file-max and generally has a default setting of less than 10000.

    Use the following to calculate roughly how many open file handles your broker may have. Be sure to round up this number to accommodate for system processes, socket connections (they each require a descriptor too!), and any other tasks that may be running on the same node.

    (Average Number of Partitions) * (Partition Size/Log Segment File Size)

Consult the Apache Kafka documentation’s Hardware and OS section for more details. To learn more about tuning virtual memory configuration in Linux consult this Wikipedia article on Swappiness and read these posts on The Lone Sysadmin blog: Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio, Adjust vm.swappiness to Avoid Unneeded Disk I/O.