Kubernetes, initially released in 2014, is an open source container orchestration system released under Apache License 2.0 and written in the programming language Go. Google originally created it, but today it’s maintained by the Cloud Native Computing Foundation (CNCF). Kubernetes offers incredible flexibility, allowing organizations to deploy, scale, and manage production grade, containerized workloads easily. That flexibility comes with great complexity and many unfamiliar terms and technologies for teams to understand. Liveness probes and readiness probes are two terms you should know as you deploy mature applications to Kubernetes. In this guide, we will discuss running a liveness probe in Kubernetes clusters as a type of health check to determine if a container is still running and responsive.
You want your applications to be reliable, but sometimes... they’re not. They may have failed due to configuration errors, application errors, or temporary connection issues. Although the reason the application became unreliable is important, it’s equally important to know that an issue has occurred or is occurring. Probes can help developers with troubleshooting by monitoring their applications for issues, but they can also help them to plan and manage resources by indicating when an application is experiencing resource contention.
Probes are periodic checks to monitor the health of an application and they are typically configured using the command-line client or a YAML deployment template. Developers can use either method to configure the probes. The three types of probes in K8s are:
restartPolicy
for the pod. You can configure startup probes in the spec.containers.startupProbe
attribute for the pod configuration. A primary motivation for startup probes is that some legacy applications require additional startup time when first initialized, which can make setting liveness probe parameters tricky. When configuring a startupProbe
, use the same protocol as the application and ensure the failureThreshold * periodSeconds
is enough to cover the worst case startup time.spec.containers.readinessProbe
attribute for the pod configuration. These probes run periodically as defined by the periodSeconds
attribute.spec.containers.livenessProbe
code> attribute of the pod configuration. Like readiness probes, liveness probes also run periodically. We will look at their details and configuration options below.By design, Kubernetes automatically monitors pods throughout their lifecycle, restarting them when it detects failures on Process ID 1 (pid 1), the init process responsible for starting up and shutting down the system. That works great when your application crashes, because Kubernetes will terminate its process and send out a non-zero access code. Unfortunately, not all applications are the same and Kubernetes doesn’t always detect failures. For example, if your application lost its database connection, or if your application encounters timeouts when connecting with a third-party service, it might not recover on its own. In cases like this, the pods appear to kubelet to be running as expected but the end users will be unable to access the application.
These types of issues can be difficult to debug because at the container level everything is operating as expected. Liveness probes solve this problem because they communicate information about the internal states of your pods to Kubernetes, which means that your cluster will handle the problem instead of requiring manual monitoring and intervention. Liveness probes reduce your maintenance burden and make certain that your application is not silently failing.
Below are details on the available types of liveness probes in Kubernetes. Selecting the type of liveness probe that most closely aligns with your application’s architecture and accurately exposes the internal state of your application is critical for successful workloads deployed to Kubernetes.
1. Command execution liveness probe: This probe runs a command or script inside the container. If the command terminates with 0 as its exit code, it means the container is running as expected.
2. HTTP GET liveness probe: This probe sends an HTTP GET request to a URL in the container. If the container’s response includes an HTTP status code in the 200-399 range, it means the probe was successful.
You can set these additional fields on httpGet for your HTTP probe:
3. TCP Socket liveness probe: This probe attempts to connect to a specific TCP port inside the container. If the specified port is open, the probe is considered successful.
4. gRPC: applications that use gRPC can use gRCP health-check probes. This type of probe has been available since Kubernetes v1.23. If gRPC Health Checking Protocol is implemented, you can configure kubelet to use it for application liveness checks. You need to enable the GRPCContainerProbe
feature gate to configure checks that rely on gRPC. You must configure the port to use a gRPC probe, and if the health endpoint is configured on a non-default service, you also need to specify the service.
In production environments, including liveness probes as a part of your application deployment templates is considered a best practice. This way you can template and reuse your liveness probe configuration across similar applications.
When getting started, it’s best to deploy applications that are intended to test and demonstrate your liveness probe configuration in a similar application to what you plan on using in production. To illustrate this below, we'll take an example image from registry.k8s.io/busybox and deploy a static pod with a liveness probe using the command execution method.
Applying this yaml in your cluster will deploy an example pod, which will succeed for the first 40 seconds, then intentionally enter a failed state, upon which the liveness probe will fail and kubelet will restart the container to restore service.
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-example
spec:
containers:
- name: liveness
image: registry.k8s.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthz; sleep 40; rm -f /tmp/healthz; sleep 700
livenessProbe:
exec:
command:
- cat
- /tmp/healthz
initialDelaySeconds: 6
periodSeconds: 6
In the example above, the Pod has a single container. The fields and commands under the livenessProbe attribute specify how you want the kubelet to perform the health checks:
terminationGracePeriodSeconds
for that container as part of that threshold.terminationGracePeriodSeconds
. If not specified, the default is 30 seconds and the minimum value is 1.cat /tmp/healthz
to perform a probe in the target container. For the first 40 seconds of the life of the container, the command returns a success code. After that, it returns a failure code.For HTTP and TCP probes, you can use a named port. An example is port: http.
Note that gRPC probes do not support named ports or custom hosts.
Successful liveness probes don’t impact the health of your cluster. The probed container keeps running, and a new probe is scheduled after the periodSeconds
delay. If you have a probe run too frequently though, it wastes your resources and can also have a negative impact on application performance. If your probes aren’t frequent enough, on the other hand, your containers may be running in an unhealthy state for extended periods of time before being addressed.
Use the fields and commands outlined above to fine tune your probes to your application. Once you know how long your liveness probe’s command, API request, or gRPC call requires to complete, you can use those values in your timeoutSeconds
(also, consider adding a small buffer period). Use the smallest value you can for simple, short-running probes. Intensive or long-running commands may require you to wait longer before repeating them and thus you will not have the most up-to-date view of the health of your containers.
Also ensure that the target of the probe command or HTTP request is independent of your main application. This ensures that it can report its status to kubelet even if your primary application fails. If your liveness probe is served by your standard application entry point, it could lead to inaccurate results if its framework fails or if it requests an external dependency that is unavailable.
restartPolicy: Always
(which is the default) or restartPolicy: OnFailure
to ensure that Kubernetes can restart the containers after a failed liveness probe. If you use the Never policy, your container will remain in the failed state indefinitely after a liveness probe fails.Using liveness probes in K8s can help you improve your application availability because they give you ongoing insight into the health of the applications inside your containers. In some ways, Kubernetes can create a disconnect, because while your pods may appear healthy, your users may not actually be able to access your apps and services. Liveness probes help you verify that your applications, containers, and pods are all running as designed and ensure that K8s is restarting containers when they become unhealthy.
You can use open source tools, such as Polaris, to apply automation to audit and revise the YAML manifest of any issues it finds. In the case of liveness probes and readiness probes, Polaris may leave comments to prompt users to make changes appropriate to the context of their application. Here is a video of me walking through some of these basic examples on setting liveness probes across clusters to ensure reliability using Fairwinds Insights that you can use to get started.
Different types of health checks can help you perform liveness checks and readiness checks in Kubernetes. Liveness probes, readiness probes, and startup probes can all help you make sure that your Kubernetes services are built on a good foundation so your DevOps teams can deliver better reliability and higher uptime of your apps and services. If you’re having trouble getting started, check out this tutorial on the Kubernetes website. Need help? Our Managed Kubernetes team has built quiet and secure infrastructure across a variety of frameworks, so you can focus on your business, not your infrastructure.
Originally published March 1, 2023, updated April 5, 2024.