Stabilizing a Kubernetes cluster that pages
The cluster waking you up is rarely a Kubernetes problem. What actually stabilizes EKS in practice: resource discipline, node lifecycle, and GitOps.
The cluster that’s paging you is almost never broken in the way it feels broken. It feels like Kubernetes is flaky. It’s usually that the workloads on it were never given the information Kubernetes needs to schedule them well, and the cluster is doing exactly what you told it to, which happens to be the wrong thing. I’ve run EKS at Postscript on a platform serving tens of millions of requests a day, at GoPro across transcode infrastructure, and I migrated business-critical microservices from ECS to EKS at CVS/Aetna. The clusters that stabilized all got the same handful of things fixed, in roughly the same order.
Resource requests and limits are the whole ballgame
Most cluster instability traces back to pods that lie about what they need. A pod with no resource requests tells the scheduler it needs nothing, so the scheduler packs nodes until they tip over. A pod with no memory limit runs until it exhausts a node and the kernel starts killing neighbors. The symptom is a cluster that’s mysteriously unstable under load; the cause is that nobody set honest numbers.
Fixing this is unglamorous and it pays off more than any other single change. Measure what workloads actually use, set requests to the real baseline and limits to a real ceiling, and the scheduler suddenly makes good decisions because you’ve stopped feeding it bad data. This one change resolves more “Kubernetes is unreliable” complaints than any other, and it requires no new tooling, just the willingness to look at consumption and write down what you see.
Stop managing nodes by hand
The second source of instability is node lifecycle done manually. Hand-built node groups drift, patch late, and turn every upgrade into an event. The fix is to make nodes disposable and automatic. I run Karpenter for provisioning, so the cluster adds and removes nodes based on the pods actually pending rather than a static group somebody sized months ago, and it consolidates underutilized nodes on its own, which quietly takes cost out at the same time.
For the node OS, Bottlerocket does the work a general-purpose Linux image shouldn’t be asked to do on a cluster: a minimal, image-based OS built for running containers, updated as an atomic image rather than patched in place. The combination means nodes stop being pets you nurse and become cattle the cluster replaces without anyone noticing, which is the actual definition of a stable node fleet.
If it isn’t in Git, you don’t know what’s running
The third fix is GitOps, and it’s the one that makes the other two stick. A cluster that’s changed by whoever ran the last kubectl apply is a cluster whose real state nobody knows, and you can’t stabilize a system you can’t see. I run Argo CD so the cluster’s desired state lives in Git and a controller reconciles reality to match it continuously. Drift becomes visible instead of mysterious, a bad change is a revert instead of an archaeology project, and the question “what’s actually deployed right now” has an answer you can read.
This is the same discipline I’d apply to any infrastructure, the console is for reading and the pipeline is for changing, and Kubernetes punishes its absence faster than most systems because so much can change so quickly. Get the cluster declarative first if it isn’t, because every fix after this one lands on solid ground instead of a moving target.
Scale on signals that mean something
Once the cluster is declarative and the nodes take care of themselves, workload autoscaling is worth doing properly. Horizontal Pod Autoscaler on CPU is the default and it’s often the wrong signal; a queue consumer doesn’t get busy in a way CPU reflects. KEDA scales workloads on the metric that actually represents load, queue depth, stream lag, whatever the real backpressure signal is, so the system grows and shrinks with demand instead of with a proxy for it. Paired with Karpenter underneath, the cluster right-sizes itself at both layers, and the 2am page for “we couldn’t scale fast enough” mostly stops happening.
Then upgrades stop being scary
A cluster with honest resource numbers, automated disposable nodes, GitOps reconciliation, and real autoscaling has a property the unstable version doesn’t: you can upgrade it without holding your breath. Nodes roll because they’re already disposable, workloads reschedule because their requests are honest, and the whole thing is declared in Git so you can see exactly what changed. The upgrade playbook is its own topic, and one I’ll write up next, because a routine EKS version bump is the clearest proof that a cluster is actually stable, and a terrifying one is the clearest proof it isn’t.
Questions this raises
- Why does my Kubernetes cluster keep having problems?
- Usually because of how the workloads are configured, not the cluster itself. The most common causes are missing or wrong resource requests and limits, which cause OOMKills and noisy-neighbor contention; node management done by hand, which drifts; and configuration applied straight to the cluster instead of through GitOps, so nobody knows what's actually running. Kubernetes is rarely the thing that's broken. The discipline around it usually is.
- What's the first thing to fix on an unstable EKS cluster?
- Get the cluster's desired state into Git and reconciled by a GitOps tool like Argo CD, so what's running is what's declared and drift becomes visible. Until you have that, every other fix is applied to a moving target you can't see. Once the cluster is declarative, resource requests and limits are usually the next fix, and the one that pays off most.
- Do I need Karpenter and KEDA, or are they overkill?
- For a cluster with variable load, they earn their place. Karpenter handles node provisioning based on actual pending pods, which beats static node groups for both cost and responsiveness. KEDA scales workloads on real signals like queue depth instead of just CPU. On a small, steady cluster you may not need either, but the instability that makes people ask this question is usually the kind those two tools address.
Consulting
Dealing with this on your own infrastructure?
I take contract and consulting engagements on exactly this kind of work.