When you've got a medium- to large-sized company, you need a platform to help your application teams ship code into production in a standardized way. This ensures that your applications are easier to maintain, scalable, secure, and cost-efficient. A platform can provide a standardized development environment, automated deployment and scaling, centralized monitoring and logging, and cost optimization. In other words, a platform can help you to improve the quality and efficiency of your application development and deployment process.
Kubernetes creates a common language for dealing with cloud infrastructure. It provides a single interface for provisioning machines to run your workloads, for provisioning disk, for getting ingress set up, for provisioning certificates for your applications, for scaling your apps and services, and many other things. That means that Kubernetes can really be the backbone of your infrastructure. But Kubernetes is kind of like AWS; you wouldn't want to just give your developers keys to the Kubernetes cluster and say, “do whatever you want.” You really need to layer some more abstraction on top of K8s to make sure that your dev teams are doing things in a secure, efficient, reliable way.
When you spin up a new cluster in your cloud environment (e.g. an EKS cluster on AWS, or a GKE cluster on GCP), you just have a vanilla Kubernetes cluster with nothing installed in it. It has a lot of functionality, but it doesn't have everything you need for a production application. You'll need some solutions for:
To do so, you need to install add-ons to do some of the peripheral tasks that core Kubernetes doesn't take care of.
You also need a way for your development teams to get their applications into the cluster and running. In the early days, you might have had your engineers run kubectl to push some resources into the cluster, but what you really want is a CI/CD process that kicks off a whole deployment process every time your developers push code or tag a particular commit. You may be using Helm under the hood to get those resources into the cluster; it's up to your platform team to figure out what's going to work for your organization.
Once you've got cluster add-ons and deployment then you've got something running… but eventually your developers are going to hit an issue where things aren't deploying, apps are not scaling well, or the application crashed. You're going to need to provide routes for them to:
Whether that's devs opening up a pull request, servicing an issue in Slack , or opening up a Jira ticket , figuring out how you're going to handle those interactions between your platform team and your development teams is super important.
What's the difference between governance and feedback? They may sound the same, but there is a difference. Governance is all about making sure that your development teams don't do anything they're not supposed to do and how to prevent problems from recurring. Feedback is all about how your development teams surface problems with your platform team.
The best way to handle governance is by putting policy in place for the development teams, such as:
Every time you see your development teams create a problem in production, it can be helpful to put some policy in place , either in a warning mode or an enforcement mode to make sure that those same problems don't come up again in the future.
If you're trying to move towards a self-service model, the first thing to understand is what's going on now. Here are a few important questions you should get the answers to:
The answers to these questions will help you figure out what sorts of policies that you might want to put in place. For example, everybody needs to have auto-scaling set up — how many people have auto-scaling set up today and how many will be affected if you put a policy in place enforcing that? To figure this out, deploy auditing tools to understand what versions you are running of everything, how many applications there are, how many people are deploying via CI/CD versus manually. Talk to your development teams, do some automated auditing of your environments, check out your logs. This will help you get a feeling for how things are working today and what's not working.
When you assess your situation, you're going to find issues like people are using way more resources or asking for way more resources they need. They're not setting up auto-scaling. They've got security vulnerabilities in their Docker containers , they are missing best practices, they’re not setting health probes , and so on. You could go into Slack and message every developer to tell them not to do things this way. But that approach is neither helpful nor scalable. The best thing to do is shift the findings left (think of the developers as being on the left side of the deployment process and your SREs all the way on the right).
The best way to do this is to start integrating auditing tools into your CI/CD pipelines. Start scanning for any issues with best practices or any policy violations. You should be scanning:
At the very beginning, start by warning developers of the issues. For example, adding comments on every GitHub PR that say: “You added a new workload here, and it doesn't have any liveness probes set up.” Or a message saying, “You created a new Docker image here; it has 10 vulnerabilities in it and two of them are critical.” That helps the developers understand what they are doing wrong. Eventually, you'll want to block issues, too. If somebody opens a pull request that's creating new issues or adding new vulnerabilities, they should not be allowed to merge that unless they get approval from somebody or they fix the issue.
It can be a journey figuring out how much you want to scan, how much feedback you want to surface, and what you want to block on. Use the findings from that first step of auditing your environment to decide what you are going to scan first in CI/CD, what you are going to block first, and what is the most important thing to start enforcing as early in the pipeline as possible.
It's easy to circumvent CI/CD. Development teams can force merge PRs or use kubectl and Helm to edit directly in the cluster without going through the CI/CD process. That’s why it's important to start enforcing policies at a lower level to ensure your development teams aren't circumventing those best practices at the CI/CD layer. A Kubernetes admission controller is a pretty hard wall blocking people from putting things into your cluster that are going to cause problems.
You can set up an admission controller in your Kubernetes cluster that says: if you see a resource that doesn't have memory requests or CPU requests, reject it, don't allow that into the cluster. Then you have a pretty strong guarantee that you're not going to see that problem in production ever, even if a developer forces a merge or the CI/CD doesn't catch it. Guardrails are a great mechanism to make sure you have certain guarantees about what does and does not get into your cluster.
Sometimes platform engineering teams are worried that guardrails will slow down the development teams by putting roadblocks in their way. These guardrails actually help devs move faster because they know they're not going to catastrophically break anything unintentionally. It makes devs more confident to ship quickly and improves self-service.
Once you have guardrails in place, you have a good baseline of an environment. But you still need to have some kind of auditing in place for your environment so your SREs can look at it and determine how healthy different applications are. Evaluate how well they are scaling and whether any are redlining in terms of memory, CPU, or ability to scale horizontally.
You need to be able to surface feedback that shows whether you are wasting $900 a month on an application that really should only be spending $100 a month. You need to do some monitoring and send feedback to the developers. It can also be helpful to allow teams to compare the security, reliability, and cost-efficiency of their workloads so they know where to focus the efforts of your platform team.
You also need to solicit feedback. Ask your devs what's working and what's not. Talk to them so you understand what problems they are running into regularly. One of the biggest metrics around the success of a platform is how fast you developers can go. A big piece of making sure that devs can self-service is asking the developers how equipped they feel to do their job. Here are a few more questions you should be asking:
It’s incredibly important, both formally and informally, to have those lines of communication open with your development teams. Make sure they have a place to surface problems when they do happen, that they're able to do their job, and that they're not spending 50% of their time wrestling with infrastructure. Soliciting feedback is the only way to get that information.
These five tips will help you improve your dev teams’ ability to self-service in Kubernetes environments, and so much of it is about enabling communication. Whether you’re assessing the situation, shifting findings earlier in the SLDC, setting up guardrails, or delivering or soliciting feedback, it’s all about getting the right information to the right people at the right time .
And remember, this is an ongoing process. You can’t set up a self-service platform in Q4 and be set forever. You need to iterate constantly because your needs are changing constantly. Teams are changing as you hire and lose people. You also have new business requirements, such as new compliance standards, to adhere to. It is an ongoing process, one filled with constant improvement and iteration. Having open lines of communication across your teams will set you up for success with a self service approach.