When I started working with Kubernetes, I thought it’d be a fairly simple transition. I’d been working in AWS for years, and on Linux for years before that. I knew about building containers, provisioning VMs, networking, managing memory and CPU - everything that goes into running workloads in the cloud. At Google, I’d even worked with Kubernetes’ predecessor, Borg. How hard could it be to learn another framework for computing?
As it turns out, very hard. Despite having now worked for over a year at a company that specializes in Kubernetes Enablement, I’m still learning on a daily basis. Kubernetes isn’t just a new syntax - it’s a whole new way of thinking about distributed computation.
The hardest thing, by far, is the sheer depth of the ecosystem. The core project has over 2 million lines of code, and comes with reams of documentation. Add to that dozens of SIGs and hundreds of third-party add-ons, both open source and proprietary, and it’s easy to get overwhelmed.
This complexity makes Kubernetes an incredibly robust and flexible platform for managing compute, and is why it has been adopted by over half the Fortune 500. But with an explosive number of ways to set up a cluster, mistakes are all but guaranteed. A lot of knowledge, and experience is necessary to keep Kubernetes clusters secure, efficient, and reliable.
The community has built some amazing open source tools to audit Kubernetes infrastructure, which has the potential to drastically mitigate these problems. But the auditing ecosystem is complex, fractured, and ever-changing. So we built Fairwinds Insights as a single abstraction layer on top of a suite of open source auditing tools. Fairwinds Insights helps organizations easily understand what they’re doing right, and where they need to focus their attention.
I’ve been lucky enough to learn about Kubernetes in the presence of veterans. We’ve got a team of amazing SREs who have managed hundreds of clusters for dozens of organizations. They’ve seen Kubernetes go wrong just about every way it can, and I’ve had the great fortune of being able to learn from their experience rather than my own mistakes (though there have been plenty of those too!)
Even before I arrived, they’d begun to put together an open source dashboard, called Polaris, that scans Kubernetes workloads for common configuration mistakes. It checks for issues ranging from security (like running containers as root), to efficiency (like not requesting a specific amount of memory/CPU), to reliability (like failing to set liveness and readiness probes). Without a project like Polaris to tell me what I’d missed, I would have needed a lot more hand-holding when deploying workloads into Kubernetes.
The team also pointed out that the open source community has built dozens of similar tools. There’s trivy for container scanning; kube-bench for testing your alignment with the CIS benchmark; or our own Goldilocks for right-sizing CPU and memory settings. We ended up cataloging over 30 open source security tools alone!
The sheer volume and variety of auditing tools out in the wild quickly got overwhelming. Some of these tools run in-cluster, others on a local machine. Some of them look at running workloads, others on the upstream YAML files. None of them can collate data across multiple clusters, track findings over time, or send an alert when a new issue arises. And they all have different output formats, with different ways of denoting things like severity and workload identity.
There was no way we could continue running all these audits manually, combing through them for new findings, and creating tickets by hand. We needed a way to operationalize all this data.
To make our lives - and our customers’ lives - easier, we created a plugin-based system for collating and tracking audit data. This consists of four main components:
First we built some common infrastructure - namely a JSON store and a PostgreSQL database - that any reporting tool can link into. The output of each individual report is saved as JSON in S3, and more structured data is extracted and stored in the database.
At the heart of our database schema is a concept called an Action Item - a finding generated by any audit - which captures potential problems with security, efficiency, or reliability.
On top of that, we built a thin translation layer: basically, some Go code to transform each report’s unique output into Action Items. Here we extract resource names, severity levels, human-readable descriptions, and remediation advice. Some of this is included in the underlying reports, while some we needed to supplement using our own knowledge and research.
The translation layer is the hardest part of the codebase to maintain, as it needs to be updated each time a report adds a new feature or changes its output structure. But it has been immensely valuable. Rather than every organization trying to build its own abstraction on top of many different auditing tools, they can leverage Insights to automatically see all their audit data in one format.
Next, we built an agent that runs each of the open source audits - currently nine in total - and sends the results back to our server, where they’re translated into Action Items. The reports all run on a configurable schedule, typically once per hour, so we always have an up-to-date view of what’s going on inside the cluster. When something gets fixed, we pick that up as well, and can automatically close out the corresponding Action Item.
This is much simpler and more robust than the typical approach, which is to run each tool manually, on an ad-hoc basis, without reference to previous runs.
Finally, we built a user interface at insights.fairwinds.com, where users can log in and see their results. We have a single table for viewing all action items, as well as some more bespoke interfaces for digging into individual reports. This allows our users to enjoy the best of both worlds - they can see everything in a unified format, or drop into a view that is specifically tailored to what they’re currently working on.
Now that we have all the data in one place, there’s a lot of amazing stuff we can do!
The first thing we did was build some basic tracking. We wanted to be able to assign Action Items to particular people, and see when an Action Item was discovered or fixed. There were also a number of false-positives (e.g. a kube-system workload that really does need root access), which we began to mark as “will not fix” or “working as intended”.
Even better, we started pushing out the normalized data to different places where we spend our time. We’ve linked it up to Datadog, so teams can track trends over time, as well as overlay Action Item events on top of core metrics like resource usage and uptime. And we’ve started sending alerts - both real-time and as a daily digest - to Slack, to make sure teams stay on top of any unexpected changes. Down the line, we’re looking into GitHub and Jira integrations to make sure the appropriate team is on the hook to remediate any issues they’ve introduced.
What I’m most excited about, though, is a CI/CD integration. Currently our agent just looks at what’s already inside the cluster - but what if we could detect issues before they even got merged into master? Only some of the reports will be able to run in a CI context, but being able to block a PR that introduces new vulnerabilities, or trace Action Items back to a specific infrastructure-as-code file, will save us a ton of time, pain, and most of all, risk.
If you’re already auditing your Kubernetes infrastructure for security, efficiency, and reliability issues, we’d love to hear what tools you’re using and where your biggest concerns are. And we’re actively looking for feedback on Fairwinds Insights - it’s available for free! Get it here.