If you’re thinking about building a Kubernetes platform that your developers can use to build applications, you need to know where to start, what questions to ask, and how to build a platform that your developers will actually use. What are you thinking about — what capabilities are the top on your list? Are you building a 'paved road' to deployment and hoping developers will use it as intended? What other factors do you need to consider? The goal of an effective internal development platform (IDP) is to enable developers to deploy easily, but you need to make sure your platform team doesn't become a Kubernetes help desk — so how can you do that?
Recently, we had a panel discussion between Fairwinds and Chick-Fil-A to discuss how they are using Kubernetes and the platform they've built to enable their development teams. You can watch the recording on demand, but this post explores their discussion of some of the obstacles they encountered on their journey to developer self-service and how they've overcome them.
Moderator: Kendall Miller, Technology Evangelist - Fairwinds
Kendall was one of the first hires at Fairwinds and has spent the past seven years making the dream of disrupting infrastructure a reality, while keeping his finger on the pulse of changing demands in the market and valuable partnership opportunities.
Panelist: Andy Suderman, CTO - Fairwinds
As CTO, Andy Suderman uses his extensive cloud native and Kubernetes experience to help drive research and development at Fairwinds. He has previously held roles as SRE, Principal Engineer and Director of R&D and Technology. He works with infrastructure spanning all three major clouds as well as verticals from Healthcare to SaaS and Fortune 500 to small business.
Panelist: Alex Crane, Enterprise Architect - Chick-Fil-A
Alex Crane is an Enterprise Architect and Technologist for Chick-Fil-A, Inc., a quick-service restaurant chain with roughly 2200 locations based in Atlanta, Georgia. Alex’s team identifies strategic and differentiating technology opportunities, develops platforms and capabilities, and stewards them to maturity. Some recent examples include Chick-fil-A’s Cloud, API, and Internet of Things (IoT) strategy/platform. Outside of the office, Alex is an avid mountain biker and lover of all things outdoors.
Overall, the panel focused on discussing all the exciting things happening in the Kubernetes ecosystem today, particularly focusing on platform teams and building an IDP to enable developers to self-service. Alex, Andy, and Kendall discussed how they think about building the foundation, or IDP, that dev teams are using to deploy software into Kubernetes environments. We’ve cleaned up the transcript a bit, but left the meaning intact. For Alex Crane and Andy Suderman, we’ve included what each said verbatim from the Container Journal Panel Webinar - A Platform Team's Guide to Kubernetes transcript. Here are a few key takeaways:
A platform is about distilling down the target, the set of tech that you want to launch your applications, deploy your applications to, to have a more digestible level for your developers.
Building a platform means letting your devs get back to being devs and power users of Kubernetes rather than having to administer so much stuff in their own clusters.
When it comes to building out an internal developer platform for Kubernetes, you need to think about the expectations your developers are going to have and build accordingly.
From the operations perspective, you need to have DNS, TLS and ingress, so you need to create a happy path for them to create a new app, put the app in the cluster and get traffic to it. And then you need to put policy and guardrails on top of all of that to make it easy to do the right thing.
To control access and authorization, start with role-based access controls (RBAC) and then use policy for the things you can't quite control with RBAC.
It would be great to see a baseline deployment option where Kubernetes is secure by default as an option for starting the cluster.
You need to be able to do some scanning and visibility into what vulnerabilities are currently running in your clusters and make sure things are adhering to best practices, including memory requests, limits, containers running as root, and so on, and automatically reject those from even running in the cluster.
Kubernetes introduces an abstraction layer between your cloud cost and your applications; if you don’t have visibility into that, then you’re not going to be able to tackle any cost conversation whatsoever. Most of the cloud providers don’t have that visibility into the Kubernetes layer of cost at this time.
Don't look at a shift to Kubernetes as going all in with the whole stack as one giant leap, because that one giant leap will just feel overwhelming.
Kendall Miller: What is a platform? Companies are building these internal platforms. What does that mean?
Alex Crane:
Yeah, so I guess in my mind a platform is about distilling down the target, the set of tech that you want to launch your applications, deploy your applications to, to have more digestible level for your developers. And depending on your company, that could be a wide range of the target you're trying to hit. Anything from they tell you or give you purely the name of the container or the artifact, or maybe not even that, maybe the repo that they want to see magically deployed somewhere. All the way to just taking raw Kube or a raw stack and they get almost direct access to it with a lightweight set of guardrails around it, so it can really run the gamut, depending on what you need. But the focus of it is to both make things easier for your developers and I think primarily reduce the number of things that need to be decided over and over and over and over again by each team.
Kendall Miller:
When we talk about platforms and foundations for a platform, I've heard people say that Kubernetes is really a great foundation for your platform. I'm not sure that's accurate. But it is a great tool to standardize building that foundation. Andy, tell us about Kubernetes at your company. Where are you running Kubernetes, what size, what scale?
Andy Suderman:
Yeah, that's a fun question for me because we do it in a lot of different ways. We got our start running Kubernetes clusters for other companies. We build clusters in customer infrastructure, we maintain them, we manage them, we've run add-ons for them. And that's really where our team got all of its expertise in Kubernetes. Doing that over the years, across tons of different verticals. Different sizes, everything from six node clusters to I think 600-700 node clusters. We've seen it all and they all introduce different challenges across three different cloud providers. And then we use it internally as well. We have our SaaS product that runs on Kubernetes. And all of our internal tooling that runs on Kubernetes. We use it for just about everything.
Kendall Miller:
Alex, Chick-fil-A famously runs Kubernetes a lot. Tell us a little bit about that.
Alex Crane:
Yeah, so maybe amusingly, we started using Kubernetes at our edge in restaurant before we started using it in the cloud. Which is probably very backwards from most people's adoption of it. Right now we have the interesting use case, the one people find very interesting typically is our Kubernetes that we run in every restaurant in the chain. We can run containerized app workloads at the edge in a very similar way we could in the cloud. And then in the cloud we have actually piles of clusters. We started our journey from having segmented clusters. Each app team had a cluster for dev and it's a cluster for test and a cluster for prod across many, many app teams, which led us to many, many Kubernetes clusters. And now we're on a journey to having fewer, more centralized set of clusters still separated by environment, but where we have more teams working in one set of cluster or in a smaller set of clusters so that we can get more of an economy of scale of those tools that we're maintaining in Kubernetes, as opposed to maintaining many, many instances of things.
And that's letting our devs get more back to being devs and power users of Kubernetes than having to admin so much stuff in their own clusters.
Kendall Miller:
Part of what we're doing in building a platform for our teams is make it so a developer is able to push their app to production and have it work. And they want the user to be able to interact with it and have it work. What does it look like to make this very, very consumable for developers? How much do you want your developers to have to understand everything about Kubernetes?
Alex Crane:
Yeah, I think it's interesting. It's a conversation I've had with a number of developers internally. And feedback is interesting on that, which is some of them, as you would expect, they don't want anything in their way. They want to be there at raw Kubernetes. I know Kubernetes, don't be in my way. A good chunk of other developers, they don't care about Kubernetes, they don't even particularly care about Docker. They have business features to ship and they want to compile, build, and move on. Or at the very least, their interest isn't in Docker, it's not in the orchestration layer, it's in their app. They think their app is cool and want to make it cooler, but how it gets run, they're ambivalent about.
At the moment, the balance that I've been looking to strike is around making both sets happy. Which is possible, which is mostly possible I'll say, in providing simple abstractions so that those that have a minimal set of info, a host name they want something hosted on and a container they want to have run. They can bring that to the table or they can have wider access to the manifests within guardrails of policy, do stuff more raw, more low level if they prefer to do that. But choose your own adventure for the dev.
Kendall Miller:
To compare Kubernetes to a sink, there are people who want to have a sink and know that their water turns on and turns off and drains. And there are people who are going to look at the sink and want to understand all the piping behind it and where it goes and what they can put down the sink and what they can't, and why, and where it's going to get clogged.
Alex Crane:
And I think it comes down, to extend your metaphor, I think it comes down to if it does get clogged or when it inevitably gets clogged, does that person ... Are they the type that wants to fix it themselves? They want to open that trap at the bottom of the drain and clear it out, or do they want to call a plumber and have the plumber help them? And that comes down to the individual.
Andy Suderman:
And that's where I start to wonder about the philosophy, this spectrum between not wanting to know anything about how my app gets deployed, and wanting to have full control over it. When you skew too far towards the complete abstraction layer, where the developers know literally nothing and don't care about how their app gets to production, then when it does break, they have no choice but to call the plumber. And that plumber is my overworked SRE team that may or may not have time to do that. In smaller companies this probably happens more often. I wonder about the value of creating the full abstraction layer without any sort of education or exposure to at least some level of troubleshooting when things break.
Kendall Miller:
Yeah, so there's a give and take here and there's going to be a spectrum of users on the other end, even when those users are the developers.
Alex Crane:
I was just going to extend with Andy's thing there. Something that I've been continuing to mull over for the last year is, when it comes to building out a Kube platform is, what are the expectations your developers are going to have? Or when you hire somebody or bring in a company to work with your platform, with your Kube, with an app deployed in it, in terms of how they've interacted with it. A lot of people, if you go take an AWS or a Google or an Azure training course in Kubernetes or one from Fairwinds, people are going to come away using kubectl and editing a Kube YAML file and knowing what a Kube service is, et cetera. If you're running Kubernetes at your company, but you've really created a complete abstraction and nothing anybody is interacting with... looks like they're not using kubectl, they're using ... Every tool that they took in their training and certification class has nothing to do with their use of the platform.
Then you end up in a very weird spot. Because you're saying you have Kubernetes and you're hiring for Kubernetes, but in practice they come in and they're learning something totally different.
Kendall Miller:
What are the tools that you put in place to make this make sense to your dev? I've got compute, a cloud, my own infrastructure, and then a layer on top of that, maybe something in between that and Kubernetes. Between Kubernetes and the developer wanting to push something through to production, what are all the layers I'm going to put in between so that things are easy and they don't mess things up?
Alex Crane:
Sure. I mean, I think to start at the beginning, I think it starts with a templating layer, whether that's Helm or Kustomize or another one where you're able to define in that templating layer what your standard if it were, or set of standard types of apps are. Whether those are event driven, those are standard rest apps, server listing Kube, whatever that is, so that a developer is able to bring that minimum set of information, depending on how that templating engine works, provide it to that template and render out that set of Kube YAMLs or the contents that would need to be applied to the cluster.
From there, I think it's about how does that get applied to the cluster? And in many cases, particularly in trainings that you take and certifications, they skip what I'll call the enterprise or best practice layer. They all say in best practices, "Use infrastructure management to apply this stuff, GitOps, et cetera." But in practice ,they have everyone do stuff with kubectl. I think that part can be fairly jarring when you switch over to start using GitOps, for instance with Argo CD or Flux or GitHub actions to push and apply stuff to clusters. Because now you've disconnected your devs one link from the cluster, at that point, in terms of how their stuff gets applied. But yeah, so templating layer and then GitOps are similar, applying those resources to the cluster. And then depending on which of those choices have been made, policy enforcement, both for security reasons, but also for best practices or just company standards adherence. Like naming policy for apps, have you set limits and requests? Everything ranging from best practices, security, and just company policy for how you want your apps deployed.
And those can either be integrated upfront, where, with Helm, for instance, you can do those validations as part of the chart and reject even generating it, all the way to the back end in the cluster. And even in the middle where you have poor request hooks and Git that can reject your developers from submitting something that's then passed the muster.
Andy Suderman:
Whenever I get asked that question, I go from the opposite side. You start from the developer and work forward. I'm an ops engineer, I come from an ops background, I was a CIS admin before. I think, "Okay, we have code, we have a container, but that container can't just run. We have to have a way to get traffic to that container most likely. We have to create a DNS name for that. We probably need TLS somewhere." And so the first thing that comes to mind is always what I call the trifecta. Which is DNS, TLS and ingress. Some form of ingress controller, and a path forward for developers to utilize that ingress controller. Whether it's writing an ingress object or if you're using something like Glue, adding a route and having the happy path for them to add new endpoints to that as they deploy new apps.
And then cert manager for certs, external DNS for DNS control, or however you want to structure your DNS. There's 1,000 different ways to do it, of course. But just some way that people can create a new app, put the app in the cluster and get traffic to it. Because, as Fairwinds spinning up clusters for new customers, the first thing they're going to ask is, "Well, okay, how do I get traffic to my container that we're deploying? How do we get TLS? How do we do all this stuff?" Those are the first things that come to mind for me. And then policy and guardrails on top of all of that, for sure. Absolutely necessary.
Kendall Miller:
Do you ever let your developers loose? How much freedom do they have? How many of your developers have root access to all the things?
Alex Crane:
Yeah. In the previous generation of clusters, the many clusters scenario, most of the people on that team did that, just like they have access to what those dev teams did. Just like they had similar levels of access to their AWS account. That model and those set of teams really spun up out of that shift right that happened a number of years ago now where it was about the DevOps model. Where, hey teams should be self-sufficient and own all this stuff themselves. That way no one else gets in their way. And that's been good and it's been fine. It's had some advantages when stuff's broken, they can fix it quickly — if they can fix it. In the new model ... And the blast radius there is also a lot smaller. It's their team's stuff, they're responsible for it. If they mangle it, it's on them. In the new model with a bit more multi-tenant, internal multi-tenant, but still multi-tenant between teams, they get very little.
They get read, they can see stuff, they can do some triage type operations, kick stuff loose. Sometimes you need to take a wrench and bash on a pod or a service and let either Kube restore it or your GitOps tool restore it. They can break stuff loose, but they don't have those mutating changes, so they can't edit DNS entries that might be there. Edit secrets directly, edit deployments directly, as that would cause drift from what's in Git. And that accumulated state drift over time leads to some big challenges.
Kendall Miller:
Alex, how do you control all that? With RBAC? Do you control that with something else?
Alex Crane:
Yeah, it starts with RBAC and then pinned behind RBAC is policy for those things that we can't quite control in the same way with RBAC.
Kendall Miller:
What do you use for policy? Every organization, particularly a sufficiently large one like yours, is going to have internal requirements, external requirements, all kinds of requirements for compliance to keep an auditor happy, to keep your CFO happy, to keep your CISO happy. How do you do it?
Alex Crane:
Right now it's been primarily Kyverno, which is an open source policy manager, a little bit of OPA, and then also a good chunk of the Fairwinds policy out of the box. I think it's in an interesting space or an interesting spot, because to me, Kube is in a difficult spot right now. It's wonderful and it's great and it should be adopted, to make that clear before I throw a little bit of shade at Kube. But it's also in that space, if any of those listening were back in the earlier days of Linux. You would set up Linux but then you would then spend a couple hours setting up all this stuff, whether it was the logging, so the logging was proper, but also security. Go lock this down, go apply this policy, et cetera. And Kube's in that state still right now where you set up Kube.
Kendall Miller:
I remember weeks just to get networking working.
Alex Crane:
Well, dependent on the time. But yeah, I mean, you get a lot to set up. And in particular with security, it's like okay, you set up Linux now, install security-enhanced Linux and make all these adjustments to your operating system. And that's kind of the phase Kube is in right now, even with the off the shelf ones from the major cloud providers and others. You set it up, then you would set up Fairwinds, policy and best practice stuff, or you go grab the Kyverno chunk of recommended, both best practices and also what lets you meet those CIS security standards. And then you layer your stuff on it. The thing I'd love to see from Kube itself, whether that's from my vendor, the one who's maintaining or who's producing Kube. EKS from Amazon are similar for GCP or Azure, I would love to see that baseline be an option for starting the cluster. Kube being secure by default, if you will. Then to your point, those things that are our company's unique stances and standards and enhancements on those, enforcing those with a policy engine like Fairwinds or Kyverno or OPA.
Andy Suderman:
I think that's what GKE Autopilot is trying to do. It's got a lot of restrictive policies in place. You can't deploy anything to an Autopilot cluster without resource requests and limits, just that base level of stuff. What I seem to be hearing from the people that we talked to about it, is that it's too restrictive. And they can't modify that enough. And so how do you strike that balance? Where do we as a Kubernetes community strike the balance between those two?
Alex Crane:
I think that's a really interesting question. I'm not terribly familiar at the moment with where Autopilot is. I've seen some stuff about it, but I don't know enough in the weeds to speak about it. But I would say in general, I would be curious about pushing back on some of that. A number of times when I hear some of that stuff's too restrictive, they're like, "It's too restrictive. It won't allow me to run containers as root. It's too restrictive, it's making me put memory requests and limits in." And it's like, ooh. But in a total sandbox environment, great, don't worry about those maybe. But even in test, let alone prod and even in dev, I can't tell you the amount of pain I have seen by not checking those boxes off upfront, and saying, we'll fix that later. It causes production outages, it causes, I mean pain porting forward, honestly, the one bone I really have to pick with Kube is the fact that they feel like they can't switch it.
That run as root is not the default posture in Kube. And in the Docker world as a whole. Because there's a lot of containers that are built by great community projects out there that you can't even run out of the box, because they expect to be run as root. And now you need to go switch the user that the files are owned by, internal to the container, set those outside. But anyway, all of that is a bunch of in the weeds reasons why there's a lot of that stuff that really needs to be there from day one. Because if you try to clean up, let's say the run as root, or not using elevated perms, or memories and requests a year into the game, it is so much more pain, than just getting those right in the beginning. But I am very open to that. There may be some things that actually are a little too restrictive that might need to be opened up in some of that.
Andy Suderman:
There's definitely an exception layer that needs to be included at some point and carefully managed, but most of those things can be avoided.
Alex Crane:
I'd much rather be opting in to a more insecure posture than having to opt in to the secure one.
Kendall Miller:
One of the requirements that you have internally to keep things working and working smoothly is cost. How do you manage Kubernetes cost? How do you stay on top of that? Do you manage cost in Kubernetes in any specific way?
Alex Crane:
Yeah, so there's probably a few layers to that onion. One is, being able to see what cost even means. At the very outside edge, you can look at your run cost in your cloud provider for your control plane and the instances it's using. You have that, but a lot of times you want to know how much of what projects is taking up what of the cluster, what of the bandwidth, et cetera. And the main project I'm familiar for that is Kubecost. There's probably some others out there that also help give cost inspection opportunities. Kubecost is one of the big open source ones. But I think, past that, it ends up being interesting. Because depending on the size of a company and the size of the teams involved, infrastructure cost, so the cost of running EC2 instances in a Kube cluster can frequently be much, much lower than the cost of the headcount, the people working on that project. Or the licenses and support that you have for tools that you're running in that cluster and in that environment.
I think cost optimizing is important and being responsible stewards of those resources that you're running. On the flip side, I do also frequently encourage people not to necessarily try to over-optimize for the infrastructure cost. I can't tell you how many times I've seen people spend weeks on end trying to optimize down, so their app would be running let's say twice as efficient in Kubernetes. And so they've literally saved tens, maybe hundreds of dollars a month for the company, but they've spent two sprints of a developer's time, so it'll take a decade to recoup the investment.
Andy Suderman:
Well, as a Fairwinds' customer, I'm surprised you didn't mention that we do cost as well. We have tools for visualizing costs in your cluster. I love that you said be cognizant of how much effort you're spending on trying to reduce cost. Because infrastructure cost is generally a smaller percentage of costs across a company than other costs. But when you are worried about it, setting those resource request limits first is the number one thing. We have so many people that come to us and they're like, "We need to right size our nodes." And I say, "Okay, but you're not setting resource requests and limits. Your node count really doesn't mean anything to me until you fix the resource request and limits. I don't know what's using what. And the cluster can't allocate resources appropriately." That's number one.
And then visibility. You're introducing an abstraction layer between your cloud cost and your applications. And if you don't have visibility into that, then you're not going to be able to tackle any cost conversation whatsoever. Whatever tool you use, just a tool, most of the cloud providers don't have that visibility into the Kubernetes layer of cost at this time.
Kendall Miller:
How do you audit your Kubernetes setup? How do you audit the access people have? How do you audit the workloads that are running in the cluster and make sure you're not running nefarious things?
Is that automated? Are you doing it manually? When Log4j comes out with a massive vulnerability, how do you find out where that's running and who you ask?
Alex Crane:
Yeah, a myriad of approaches. In the way you approach anything responsibly today, you throw everything you can at it from every single direction until it's finally pinned down. That's what it feels like wrapping your hands around security today. But we use a range of things. Fairwinds has some tooling in the Fairwinds Insights product that helps us do some scanning and visibility into what vulnerabilities that we have currently running in cluster. As well as which things are adhering to best practices like memory requests, limits, stuff running as root, et cetera, allows us to reject those from even running in the cluster. We do similar with Kyverno as well, for some different things. Because there's a real good community out there that's producing some things. We get to cheat and use what other people have submitted as really good policies to just go ahead and sling into the cluster to do things that we want in addition to the ones that Fairwinds works with us on.
And then past that, we use a few other tools, even let's say out of band of Kubernetes itself. We use some container scanning tools around our containers and we also use a product called Wiz, which has been fantastic for letting us have visibility into our security posture both at a cloud level and that things, just in our cloud accounts themselves, are meeting the muster, but it also looks at, it understands Kube. We have everything from the load balancers, IAM policies, Kube, and at the severity of those vulnerabilities, et cetera. We can triage those as they come up.
Andy Suderman:
One of the things you can do too that some of our customers do is get the list of vulnerabilities from Insights. That's great. But sometimes, for some customers, it's a massive list. And then it's like, "Okay, how do I tackle this?" When you're building a container in CI/CD, we can say, "Hey, these are the known vulnerabilities in just the container you're building right here." And so then they can go try to mitigate that. Or having policies to not allow certain levels of vulnerability to be deployed at the time. Of course, there are some that pop up post-deployment. And then alerting mechanisms or creating Jira tickets out of that, feeding those back to the developers.
But shifting it left can be a nice way to distribute the load of that. Instead of having one central team that's like, "Hey, we have 10,000 vulnerabilities, we need to fix these." That's just a mountain of work that no one person can tackle. Base images that you're using to build your containers from are often a source of your vulnerabilities and often an easy way to fix. If you're based on Alpine and if there's a vulnerability that's released, there's probably a new version of Alpine that has mitigated that already. And so we're going to be adding base image detection to some of our capabilities to say, "We think you're using this base image, go update that. You'll fix these vulnerabilities."
Kendall Miller:
Fairwinds Insights is one of the tools. There is other tooling out there that does something similar, that can stop things from being deployed that have known vulnerabilities or have known security issues. Or are just configured to run as root. You can stop those things from being deployed into a cluster with policy. And on the one hand, when you're building a platform, if you work on a platform team, you feel this tension of, "Well, I don't want to keep developers from being able to do their job, but I also want to put appropriate guardrails around it." But I also think if you put yourself in the developer's shoes, a lot of times the developer's sayin, "I don't know how to configure this. I don't know what has known vulnerabilities, please tell me. I don't want to deploy something into the cluster that's broken, that has a known vulnerability, I'm just trying to do my job. I'm just trying to deploy and now I'm going to put a known CVE into the cluster. Why didn't you stop me? How come you didn't build guardrails for me?"
Alex, you mentioned RBAC. How do you handle RBAC? Do devs have access to things? Do devs have access to only their teams' name space or do they have their own clusters?
Alex Crane:
Yeah, in our previous model, teams had their own clusters and the new model they have access to only their name spaces. We use labels on the name spaces to map from their identities to groups and their identities when they connect to the name spaces that they have access to.
Kendall Miller:
And somebody asked for your thoughts on Red Hat's portfolio, OpenShift, Advanced Cluster Manager, Advanced Cluster Security versus a hyperscaler option like EKS, AKS, et cetera?
Alex Crane:
No, I was just going to say, I've heard wonderful things from Red Hat's portfolio and people that are using Red Hat's portfolio. And I've heard wonderful things from those on all flavors of the hyperscaler options as defined there. As well as some smaller players in the market. And I think that's one of the exciting things about Kube — is that no matter which one of those you're on, sometimes you'll be building in-house stuff and then sometimes you're going to buy something a little more off the shelf from a vendor that you need to deploy into your environment. And because they're all largely adhering to the Kube contracts, it's like installing something on Ubuntu versus SUSE versus Red Hat in which, there may be a little bit of work there to do, but it's still a solid target to hit, no matter what. I would say with any of those, they're all really good options.
Kendall Miller:
Another question was Alex, are you using Fairwinds Insights and their managed cluster solution? Why did you choose Fairwinds over other solutions?
Alex Crane:
Yeah, for a truthful answer, no. Trying to go three or four years , I was very interested. I think Fairwinds was very early to the game with their Insights product. And some of what they were doing from being able to see both security issues but really tech debt and best practices issues with their Insights product. Kyverno didn't exist. OPA probably existed as a project, but I wasn't really seeing it around too much and a lot of people weren't leveraging it that I was seeing. I could be wrong on that. That's usually when I get the message telling me that OPA was around before Kubernetes, and then it's been heavily used forever. I found that fairly compelling, as opposed to some of the other options that were maybe focused very heavily on the Kube itself front, but not on the where we were going to be two years later with Kube front, which I felt Fairwinds had a really good handle on when we started to engage with them.
Kendall Miller:
Another question: why isn't there out of the box security in Kubernetes?
Alex Crane:
Yeah, okay. I am not on the Kubernetes security SIG or one of the core maintainers. I may be wildly off base here, but from talking to some people who were, over time, and others who were in the know. I think part of the challenge there is there's a big focus on not having a big drift on the core API that makes up Kubernetes. And in order for them to address certain things, they have to break the contract that a lot of apps need. And in some cases they've been able to do that with the switch to RBAC by default as the default option. And a few other things they've deprecated recently. But there's a few others that for reasons I'm not in the weeds on, it's harder to get over the hump on. The run as root or stopping the run as root stuff or redoing the way secrets work to something that's sane and not absolutely insane.
But yeah, I think the problem is, yeah, some of those it's just in the weeds. And then honestly the other piece of that is. It's a philosophical difference on what Kubernetes Is. To a good chunk of people, particularly from my conversation with some people who've worked on core Kubernetes and worked at Google and some others, is they see Kube as the just total, I'm trying to remember the metaphor Kendall used earlier, but it's the building blocks. It's the Legos that you're going to put a platform on top of that your users won't even know they're in Kubernetes. It's the basis for you building your own complete platform that's at its own complete interface. And I think that is a fair way to look at it. I also think that what me as a customer of Kubernetes and what a lot of other users are looking for is something where, well no, why can't Kube be the platform?
If you make these three little twists, it can be either, right? You can keep people close to it or you could completely abstract it if you want. But I think that philosophical piece there is one of the reasons why some of that hasn't gotten across. Similar reason to why Generics didn't make it to go forever. Because I'll tell you, let's say in defense of that, even though I'm on the other side of the fence. It's very easy as somebody who uses a thing to demand new features and say you want new features and you want change. And I'll tell you, there's a lot of projects I've followed over time, and actually I probably for the better part of valor won't name them, but that have gotten lost along the way.
They got all the features in there that everyone talked about would be cool and awesome and change the world. And three or four years in you have a project that's a nightmare for new people to come in to use, maintain, it's got security problems, it's hard to understand, digest. And by keeping things tight and close and everyone at the end of the day is actually able to move faster and better.
Andy Suderman:
It's a massive problem. Yeah, I mean it's like you were saying, it's a balancing act by the maintainers of flexibility, reliability, and security. And I mean really at any given time, pick two maybe, right? But you're always just balancing between lots of different factors and looking at it from one perspective, security is lacking, is an underdeveloped view of what is a massive code base and a very complex project that has its own governing system around it.
Kendall Miller:
The thing about Kubernetes is you can do anything in it. If you want out of the box, sane defaults, check out Polaris, that's one of our Fairwinds open source projects. It exists because we've seen lots of people stand up Kubernetes make the same mistakes over and over again.
If you turn on out of the box defaults, it enforces a whole bunch of things that just make sense. We push that into Insights, our SaaS product, with a whole bunch of other same defaults. If you have a lot of clusters, it's actually really easy to get reasonably right. Now, you're probably going to have some compliance needs that are specific to you and your organization and your vertical.
Another question: sometimes we have a problem where we compare our infrastructure to name brands and find out that our costs are multiples higher. With that said, do you think this is due to not doing things right, or are we not developing with scaling in mind? What are three top priorities to design and develop for scale in Kubernetes for a platform?
Alex Crane:
Yeah, I think in my mind ... The devil's in the details on that frequently. It can be a little hard to say without looking in the weeds, but just having seen some circumstances like that in the past, I would say one piece is density. How much, I'm going to say work, how much business work are you getting out of your containers? And those nodes, those instances, those are running on. That hits at two tiers. One is like the programming languages is involved. To avoid throwing further shade at projects in this, I'll keep these generic. But some programming languages take a lot of memory and a lot of CPU to handle a certain amount of requests, while some other languages or more modern frameworks in those languages require substantially less, sometimes a hundredth the amount of CPU in memory to handle a certain volume of requests.
I've definitely seen where maybe some older code has been reported forward and handling many more requests than it ever did in the past. And it's just a lot of memory, a lot of CPU serving very little at the end of the day. Some optimizations can be had there. Another one is density, particularly in our mini cluster scenario, we have piles of clusters right now that are running as three node clusters in multiple AZs that are running two or three apps. They're not dense at all. And that lack of density, if you have this big sprawl of lots of clusters that are only fractionally used, you can end up coming out much higher on the top end. But there's a number of tools; Fairwinds has some good insights and your cluster and what your run costs are, what's being used on the nodes that can help.
There's a few other tools to that effect as well to give you that visibility into what does your density look like. And there's some other really cool projects coming out in that front as well. Karpenter from EKS and a couple open source projects I wish I could remember offhand. And I know GCP has one as well, where it launches, instead of, to keep it short, it helps you manage the containers to hosts that are running, which can really help you bring that density level up and your total costs down.
Kendall Miller:
Any nuggets of wisdom for all the Kubernetes people that are tuned in?
Alex Crane:
At the end of the day, remember... don't compare ... Or yeah, make sure you're comparing apples to apples on stuff. I hear a lot of times people say Kubernetes is complicated, right? Kubernetes spinning up a cluster and then using kubectl against it is about as easy as Docker Compose is as easy as a Lambda. It's once you start adding everything that makes your company's stuff together into a full CI/CD platform, from GitOps and tooling and authentication and all that stuff, is when it starts to get complicated. But you'll see those in those other spaces too if you do those in those ways. That said, if you're at a smaller company or a smaller team and you've been doing things manually in your cloud provider's console and you've been happy with that and you've been fine with a Lambda with that, huge risks involved with that.
But don't look at a shift to Kubernetes as going all in with the whole stack in the right way of doing things as one giant leap, because that one giant leap will just feel overwhelming. If you take steps, look at where you are now, look at doing an analog for that in Kubernetes, and then taking steps along that journey to improve your setup there. I think you'll find a lot more success and joy with it.
Andy Suderman:
I'm going to end with the thing that I always end with, because I still see it, despite the fact that we're seven years in. Set your resource requests and limits, please. That's it.
Kendall Miller:
Use Goldilocks, we have an open source project for that. If you have lots of clusters across lots of places, use Insights.