Many adopters of Kubernetes attempt to capitalize on 100% of the existing infrastructure automation technology that predates cloud native applications and Kubernetes, such as Puppet, Chef, Ansible, Packer, and Terraform. Using these tools in a non-cloud native manner does not yield the best result. For example, using a configuration management tool to build container images adds unnecessary time and complexity to your application deployments, which can cost you agility and ability to recover.
Kubernetes reliability becomes much easier to achieve with the right configurations. And the degree and manner you use pre-Kubernetes tools will likely change as you adopt containers, Kubernetes, and cloud native architecture.
Reliability in a Kubernetes environment is synonymous with stability, streamlined development and operations, and a better user experience.
In Kubernetes, it’s easy to configure things incorrectly. Deploy the right configuration to ensure stability, streamlined development and operations and a better user experience.
In a Kubernetes environment, reliability becomes much easier to achieve with the right configuration. Here I suggest four Kubernetes best practice tips for increased reliability. You can read my complete recommendations here.
Use cloud native architecture to help embrace the ephemeral nature of containers and Kubernetes pods (a running instance of your application container). Two examples include:
Kubernetes helps improve reliability by providing redundant components, and making it possible to schedule application containers across multiple nodes and multiple availability zones (AZs) in the cloud. Use anti-affinity or node selection to help spread your applications across the Kubernetes cluster for high availability.
Node selection allows you to define which nodes in your cluster are eligible to run your application based on labels. The labels typically represent node characteristics like bandwidth or special resources like GPUs.
Anti-affinity allows you to further constrain nodes where your application should not be allowed to run, based on the presence of labels. This keeps your application containers from running on the same node, or from running on the same node with other components of the same application. Read some more fault tolerance advice here.
Resource requests and limits for CPU and memory are at the heart of what allows the Kubernetes scheduler to do its job. If a single pod is allowed to consume all of the node CPU and memory, then resources will be starved from other pods and potentially Kubernetes components. Setting limits on a pods consumption will increase reliability by keeping pods from consuming all of the available resources on a node (this is referred to as the “noisy neighbor problem”).
By default Kubernetes will begin sending traffic to application containers immediately. Increase the robustness of your application by setting health checks that tell Kubernetes when your application pods are ready to receive traffic or if they have become unresponsive. See my advice for setting these probes.
If you want more in-depth analysis on building reliable clusters, check out our Kubernetes Best Practices. You can also read more about best practices for security and efficiency.