Factors to consider for cluster planning

Planning for a cluster includes taking a number of factors into account, such as application-specific requirements and existing infrastructure. Network performance is critical to the health of the cluster. For the best results, use the fastest available network.

You will likely need multiple clusters to satisfy requirements at different phases of the lifecycle. In a large organization, managing multiple clusters with different requirements can become difficult. Lightbend recommends using Ansible Playbooks to automate installation and reduce this burden.

The GitHub OpenShift-Ansible repo contains Ansible roles and playbooks to install, upgrade, and manage OpenShift clusters. The installation and management of a capable, multi-node cluster requires the execution of Ansible from bootstrap node. This node can also serve as a bastion for private network access.

Development, testing, and staging environments

In development environments, build services, such as Jenkins, will be required to build custom applications. Building applications requires additional resources for building, testing, and storing builds.

Developers might use a non-redundant cluster for multi-node integration and end-to-end testing before releasing to the pre-production testing environment. Development clusters do not need to simulate the production environment exactly; their purpose is to increase the team’s confidence in their ability to successfully make changes to the production environment. And, development clusters may be short lived, such as when validating a new cluster version. In contrast, staging and inspection need to be run in a pre-production environment that mimics production as much as possible. This will usually be a production clusters that is not yet configured to accept end-user requests.

Production environments

Lightbend recommends testing in a production-grade environment as soon as possible during the development process. Production deployment requires numerous services and resources such as a Docker Registry, DNS services, and version control. If your organization already has suitable solutions, using them reduces the resources required by the cluster. Alternatively, if you must provision supporting services, your cluster will require additional resources—​from CPU and storage to Ingress and service deployments. These resources must be accounted for over and above those required to provision and maintain the cluster itself.

For tight access control, consider using private clusters, as described in this Google Cloud documentation. This setup typically uses a front-end load balancer, such a node running HAProxy in front of the cluster master nodes. On AWS this might be an ELB instance. This proxy load balancer will be the target for the commands you use to administer the cluster and submit applications. For single master test clusters, a VPN into the cluster private network can be used to directly target the single master. You will also need public proxy nodes and/or load balancers for services you intend to make available.

Making application services available to users requires additional Ingress resources. Nodes with HAProxy or load balancers are used to proxy Ingress. Infrastructure nodes, possibly with public IP addresses, can be utilized to segment user services from infrastructure services. X.509 Certificates will be required for secure, TLS connections. Clusters with public endpoints will also require public IP addresses and DNS records. Sub-domains can be used to off-load DNS record management, which is particularly useful when new Ingress routes will be created and removed often.

The availability requirements of the deployed applications also significantly impacts sizing. Highly-available clusters require redundancy, for example, at least three Kafka brokers. Further resiliency can be achieved by replicating services across data centers and with cloud vendor availability zones. Highly available systems may require multiple data centers, each hosting multiple production clusters. In contrast, in development clusters where data loss is acceptable, single instances of critical services can be sufficient.

Don’t overlook the fact that monitoring of a cluster and log storage also require sufficient resources. Time-series data storage is required for logging and metric data, such as Lightbend Telemetry (a.k.a. Cinnamon library), which is part of Lightbend Platform. Lightbend Console uses Prometheus for data storage and Grafana for dashboards and services. It also ingests OpenShift cluster metrics.

Lightbend strongly recommends that tools for metrics and log aggregation be installed in the cluster or available as external services. They should provide appropriate query and presentation capabilities, essential for isolating and debugging problems.

If you need a log management solution, Lightbend recommends Humio. It can store events locally in the cluster or in the Humio cloud with live dashboards that provide very flexible, powerful, and fast query capabilities. Contact humio for Openshift-specific installation instructions, as they differ from the Kubernetes installation. Also, the EFK stack uses Fluentd to aggregate event logs into Elasticsearch.

Persistent and transient storage

Storage is a critical aspect of planning your cluster. As a general rule, clusters require multiple types and classes of storage for both applications and services. These OpenShift examples are informative. OpenShift can also provide ephemeral local storage for pods. See also the OpenShift Persistent Storage and Configuring Docker Storage.

To inform storage decisions, gather estimates on the following:

  • The quantity of application data that will require long-term storage

  • The amount of transient data, such as messages in Kafka topics between processing services, that will pass through the system

  • The amount of space required for Docker images and other build artifacts

  • The amount of space and length of time required to store logs and metrics

For example, choosing the most appropriate kinds of storage for stateful services like Kafka will help to properly sizing them. As with other resources, if the cluster can utilize SAN or other centralized storage solutions, particularly with fast I/O, less must be provided for by the cluster itself. The Hadoop Distributed File System (HDFS) is a commonly used off-cluster storage location.

Setting up and using chart values describes how to set up Persistent Volumes for Lightbend Console.

Security

The Kubernetes' documentation contains a section on securing a cluster, which is a good introduction. The resource site Kubernetes Security provides further detail on the subject. Finally, the Kubernetes Security and Disclosure Information site describes Kubernetes security and disclosure information.

For OpenShift, the OpenShift Container Security Guide provides guidelines as you set up your clusters. You should also follow the OpenShift reference architectures, where applicable. These architectures are maintained by the OpenShift architecture team and document best practices.

Lightbend recommends:

  • Putting all clusters inside a private network in order to better control public access to the cluster. Administrator access to the master control plane can be provided using an external load balancer.

  • Encrypting the cluster secret storage mechanism. By default, secrets, such as passwords and API keys, are base64 encoded in etcd. Third-Party stores, such as Vault, and Conjur are solutions designed for secure secret storage.

  • Using VPNs to provide cluster access without exposing any publicly-routable IP addresses, as discussed next.

VPN Usage

Using a VPN to access the cluster remotely provides convenient access for users and administrators, while also providing extra security by eliminating the need for cluster nodes to have publicly-routable addresses.

Use of Lightbend Platform, Kubernetes-based tools, and implementation of many development and deployment scenarios is much easier when you have relatively unfettered access to the cluster nodes. This is especially true when your VPN DNS configuration permits domain name resolution from your workstations to using the DNS services running in the cluster.

VPN Usage in AWS

Configuring VPNs will depend in part on your on-premise infrastructure and on your cloud provider, if applicable. As an example, here are high-level instructions for using a VPN with an AWS-based OpenShift cluster, following the approach we use in the Lightbend Platform team for our own development purposes. Contact Lightbend for more detailed assistance.

  • Create a VPC, e.g., 10.1.0/16

  • Use a public subnet, e.g., 10.1.0.0/24

  • Use a private subnet, e.g., 10.2.0.0/24

  • Attach an Internet gateway to the VPC

  • Create a NAT Gateway and place it in the public subnet

  • Define security group rules on the public subnet:

    • From 0.0.0.0/0 to port 22 and 1194: allow

    • Full, unrestricted traffic flow within the subnet

    • Bi-directional all traffic flow to and from the private subnet

  • Define security group rules on the private subnet:

    • Full, unrestricted traffic flow within the subnet.

    • Bi-directional all traffic flow to and from the public subnet

  • Create an OpenVPN access server. For example, select a community or AWS Marketplace AMI preconfigured with OpenVPN. You’ll need to edit /etc/openvpn/server.conf to reflect the particulars of your network setup and restart via systemctl restart openvpn@server. Also, comment out the DNS settings in the server.conf file.

  • Install a suitable VPN client on your workstation, such as Tunnelblick.

    • Configure a client connection for the VPN.

    • Verify that your VPN is working; can you access the cluster?

Additional planning resources

OpenShift provides the following helpful resources on the planning process, the general principles should apply to any Kubernetes-based platform:

The next page, Sizing recommendations offers information on sizing physical resource requirements.