In container orchestration, Kubernetes has become the go-to solution for deploying and managing containerized applications at scale. Ten years after Kubernetes was made publicly available, however, the complexities associated with deploying it across diverse environments requires tooling to help maintain efficiency, security, and reliability. Integrating Infrastructure as Code (IaC) tools, such as ArgoCD, Terraform, and Vault, can help site reliability engineers (SREs) make Kubernetes management more automated, secure, and scalable.
At Fairwinds, we run Kubernetes clusters for other organizations — essentially providing Managed Kubernetes-as-a-Service. We run our clients’ Kubernetes infrastructure inside their cloud, and we build it so that it’s portable and our customers can use it even if we’re no longer managing Kubernetes for them. This creates some interesting challenges, because we are managing Kubernetes infrastructure across three different cloud providers (Amazon Elastic Kubernetes Service, Microsoft Azure Kubernetes Service, and Google Kubernetes Engine) and dozens of different cloud accounts that aren’t owned by the same people. Understandably, there’s no central governance over those accounts. To make it easier for us to manage that infrastructure and build automation around it, we brought together a whole bunch of different cloud native open source technologies into our management workflow.
To do all of this effectively, we had several requirements:
Our inventories are folders in the IaC repository that roughly map to cloud accounts or individual clusters. And within each inventory, we have Terraform and Kubernetes Manifests. In order to run the Terraform, we use Atlantis, which is an application for automating Terraform via pull requests. We deploy it into our infrastructure as a standalone application, and no third-party has access to our credentials. We run multiple Atlantis applications to segment the cloud credentials that are used to run the Terraform, the execution environments, and the permissions in Atlantis. Each instance has access to only the credentials and code required for the cloud accounts or clusters for that customer. The Terraform code manages the Virtual Private Cloud (VPC), the Identity and Access Management (IAM) roles, and the other things we need underneath our Kubernetes cluster, as well as the cluster. Once we have a Kubernetes cluster (regardless of whether it’s an EKS, GKE, or AKS cluster), we need to install things on it.
Now that our cluster is running, we need add–ons, such as external DNS, a metrics server, and so on. In each cluster, we need to install Argo CD. That enables us to apply manifests from that infrastructure as code repository. The challenge here is secrets management. We needed to be able to get secrets into Argo CD and Atlantis in order to automate authentication with each of the different cloud providers. At Fairwinds, we chose Vault, which brokers and integrates with trusted identities, allowing us to automate access to secrets, data, and systems. It has the credentials needed to assume roles and hand out assumed roles to both Atlantis and our SREs for each cloud environment.
In Vault, we can store any secrets that need to be installed in the client clusters, such as application programming interface (API) keys. This enables us to automate many things by using the external secrets operator, which pulls secrets out of Vault. This is great, because we don’t have to keep secrets in our code repository and still allows us to apply our manifests using Argo CD.
When an SRE needs to make changes in the infrastructure, the place they do that is in the IaC repo — in the Terraform. Then we create a system of checks for pull request (PR) reviews and similar tasks to make sure that, while it’s easy to make the changes needed for our clients’ infrastructures, it doesn’t happen without a plan and review process.
Most of what SREs do is via pull requests – this allows us to restrict their day-to-day roles to limited access. The internal tooling and automation can do the heavy-lifting with privileged accounts. However, sometimes we still need the ability to get in and make changes manually in an emergency. In this case, SREs are able to escalate their role and run privileged commands in their command line. When they do this, it's logged in Vault, and we're able to trigger a notification to Slack asking them for justification. This balance allows us to be flexible, but also follow the principle of least privilege at all times.
We created a centralized templating engine (we call it Terrafish) that spits out all (or most) of our Terraform, which we also use to generate manifests (via Helm) for the GitOps repo that we manage every client infrastructure with. And of course, sometimes we need to be able to add on top of this templating for individual client needs. To do that, we can go into an infrastructure code repository and make a PR on it. Together, these tools enable us to securely build, manage, and maintain hundreds of Kubernetes clusters for our many clients.
To watch how we do this and walk through a demo of everything in action, check out our Cloud Native Live webinar. In the webinar, we show you:
If you need help getting a production-grade Kubernetes cluster up and running securely and efficiently, reach out. We deliver white-glove Managed Kubernetes-as-a-Service that’s fast, secure, and stable, so you can focus on your business.