Deploying your first Kubernetes cluster on a service like GKE or AKS is pretty straightforward, but when you want to actually use Kubernetes, add-ons become important.
Add-ons extend the functionality of Kubernetes; some examples include Metrics Server, cert-manager, cluster autoscaler. The challenge with add-ons, just like with any software, is that they require upgrades.
Each time there is an add-on upgrade needed, there are a number of checks that need to happen to ensure the latest version is compatible with your cluster and no breaking changes. This process can be extremely time consuming, especially when you are managing more than a handful of clusters.
That’s why Fairwinds has released its latest open source tool, GoNoGo, a spec that can be used to define and then discover if an add-on that was installed with Helm is safe to upgrade.
Monitoring for add-on upgrades is part of managing clusters. In this example, we’ll talk through a cert-manager upgrade. An SRE will be aware of an upcoming or new cert-manager upgrade. The SRE will need to determine if the upgrade is needed by reviewing the documentation and release notes.
GoNoGo depends on users managing and upgrading add-ons with Helm. Once you’ve decided you want to do the upgrade, you’ll plot a number of fields in a YAML file called a bundle spec. These fields will be used to set parameters that will be evaluated against cluster and Helm chart information to determine the upgrade confidence of that add-on.
For example:
You may have a version that only runs on Kubernetes version 1.23 and above
You may want to check that certain APIs are available in the cluster
You may want to check that your manifests are specifying the API versions that are appropriate for the version you're upgrading to
All of this information will be bundled in a YAML file and fed into GoNoGo.
You’ll then run GoNoGo with a flag pointing to this bundle file. GoNoGo will compare the list of add-ons in your bundle spec to the Helm releases deployed in your cluster, and will run your specified checks against add-ons that have been successfully deployed to your cluster. GoNoGo will then use a combination of chart manifests, the upstream repo, and cluster manifests to run against the various checks in your bundle spec.
In your bundle you can also set customized OPA checks that will look for things like annotations and fields in a manifest. GoNoGo will run the OPA checks against the individual YAML files of the Helm chart, as well as any objects you’ve specified in the bundle under the “resources” key.
It also runs json schema validation. GoNoGo can validate against an upstream values.schema.json file stored in a repo, or you can provide your own schema in-line. For example, your schema could specify that it will check that the imagePullPolicy is set to Always.
While you will still need to monitor for add-on upgrades, review documentation and check in the changes within your spec, GoNoGo helps you save time from manually reviewing all objects in your cluster.
For example, if cert-manager was deprecating an annotation, you would need to go and look through both the Helm chart upstream that you're going to pull from and check whether or not the Helm chart has updated its manifests. You would also need to go into your cluster and do a Helm get manifest
and look through all of your manifests individually to make sure those also have the correct annotations. GoNoGo does this for you. Because GoNoGo does this automatically, there is less risk of missing a file / human error in an upgrade.
The biggest risk to an add-on upgrade is that because they are used cluster-wide, they could have far-reaching consequences to your cluster if something goes wrong. The scope of impact will depend on the add-on. For Metric Server, it’s probably not a big deal. But if you missed an nginx ingress add-on by not specifying an ingress class name properly, it might not pass traffic causing unnecessary downtime. Similarly, if you have a broken cert-manager installation, then there's a window where you can't generate new certs for workloads while you're trying to figure that out. So while the cluster might not be down, it could be impacting service.
If you are upgrading add-ons, check out GoNoGo.