Kubernetes clusters routinely waste 40-70% of provisioned resources. Here is the complete playbook for cutting cloud spend without touching your SLOs.
Status: DRAFT
The Kubernetes bill arrives. It is double what was budgeted. Someone asks the engineering team what changed. Nothing specific changed — the cluster just grew because it was easier to over-provision than to tune. This is the default state of most Kubernetes clusters older than eighteen months.
Studies consistently show that 40-70% of provisioned Kubernetes resources go unused. For a company spending ₹50 lakh per month on cloud infrastructure, that is ₹20-35 lakh sitting idle every thirty days.
Kubernetes does not manage cost. It manages availability. When in doubt, Kubernetes does the safe thing: it keeps the resource alive, even if nothing is using it.
The three biggest cost drivers in a typical cluster:
requests.memory: 2Gi and requests.cpu: 1000m out of caution. The actual usage is 200Mi and 80m. The scheduler reserves the full requested amount and nodes fill up with ghost capacity.Each of these has a specific fix, and none of them require touching your production SLOs.
You cannot optimize what you cannot measure. Start with resource usage visibility.
## Install Kubernetes resource report toolkubectl top nodes ## node-level CPU and memory usagekubectl top pods --all-namespaces ## pod-level usage ## Find biggest resource consumerskubectl top pods -A --sort-by=memory | head -20For a proper view, deploy Goldilocks — it watches actual usage and recommends correct requests and limits:
helm repo add fairwinds-stable https://charts.fairwinds.com/stablehelm install goldilocks fairwinds-stable/goldilocks \ --namespace goldilocks \ --create-namespace ## Label a namespace to enable recommendationskubectl label namespace production \ goldilocks.fairwinds.com/enabled=trueOpen the Goldilocks dashboard and you will see a table showing every deployment's actual CPU/memory usage vs its requested values — and an auto-generated recommendation for what the requests should actually be.
This single step is typically worth 30-40% cost reduction for teams that have never done it.
A pod with oversized requests blocks scheduler capacity even while idle. The node reports "full" to the scheduler when it is actually at 20% real utilization.
Here is the pattern: use VPA (Vertical Pod Autoscaler) in recommendation mode to generate correct values, then apply them to your manifests.
apiVersion: autoscaling.k8s.io/v1kind: VerticalPodAutoscalermetadata: name: payment-service-vpa namespace: productionspec: targetRef: apiVersion: apps/v1 kind: Deployment name: payment-service updatePolicy: updateMode: "Off" ## recommendation only, no auto-apply## View VPA recommendations after 24 hours of datakubectl describe vpa payment-service-vpa -n productionApply the recommended values to your Deployment. Then do this for every service. It is tedious — it is also the highest-ROI work you will do all quarter.
Static node groups are a cost trap. A node group provisioned for 100 pods during a Swiggy dinner-rush peak sits at 12 pods at 3 AM.
Cluster Autoscaler (CA) scales your node groups based on pending pods and removes underutilized nodes automatically.
## Cluster Autoscaler key configurationautoDiscovery: clusterName: prod-cluster extraArgs: scale-down-utilization-threshold: "0.5" ## remove nodes below 50% use scale-down-delay-after-add: "10m" ## wait before scaling down skip-nodes-with-local-storage: "false" balance-similar-node-groups: "true"Pair Cluster Autoscaler with Karpenter if you are on AWS — Karpenter provisions exact-fit instances rather than pre-defined node types, which eliminates the "I need 4 CPUs but the only node type available is 8" waste.
## Install Karpenter (AWS)helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \ --namespace karpenter \ --create-namespace \ --version "${KARPENTER_VERSION}"Your staging cluster does not need to run at 2 AM on a Sunday. Dev environments do not need to run at all during weekends.
KEDA (Kubernetes Event-Driven Autoscaling) can scale deployments to zero on a schedule and scale them back up before business hours:
apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: staging-api-scaledown namespace: stagingspec: scaleTargetRef: name: api-service minReplicaCount: 0 ## allows scale-to-zero maxReplicaCount: 5 triggers: - type: cron metadata: timezone: Asia/Kolkata start: "30 8 * * 1-5" ## scale up 8:30 AM weekdays end: "0 21 * * 1-5" ## scale down 9 PM weekdays desiredReplicas: "3"For dev namespaces, go further: use kube-downscaler to automatically scale everything to zero outside working hours across entire namespaces, not just individual deployments.
Spot (AWS) / Preemptible (GCP) / Spot (Azure) instances cost 60-90% less than on-demand. For stateless workloads — API servers, workers, batch jobs — they are a direct cost lever.
The key is a correct pod disruption budget and fast restart behavior:
apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: api-pdb namespace: productionspec: minAvailable: 2 ## always keep 2 pods alive during node drain selector: matchLabels: app: api-servicePair this with a node taint for spot nodes so only workloads that tolerate interruption get scheduled there:
tolerations: - key: "node.kubernetes.io/spot" operator: "Exists" effect: "NoSchedule"Never run stateful workloads (databases, Redis with persistence, Kafka) on spot nodes.
Wire everything together in Grafana with a cost dashboard that shows spend by namespace, by team, and by workload. OpenCost is the open-source standard for this:
helm install opencost \ opencost/opencost \ --namespace opencost \ --create-namespaceOpenCost breaks down your cloud bill by Kubernetes namespace, label, and deployment — so when engineering leadership asks "which team is spending the most," you have a real answer in thirty seconds.
Don't optimize and change SLOs at the same time. Run two weeks of VPA recommendations in observation mode before applying anything to production. Compare actual resource usage at peak load (dinner time for Swiggy, market open for Zerodha, festive sale launch for Flipkart) to the VPA recommendation before trusting it.
Set LimitRange objects in every namespace to prevent new deployments from landing without resource requests:
apiVersion: v1kind: LimitRangemetadata: name: default-limits namespace: productionspec: limits: - default: memory: 512Mi cpu: 500m defaultRequest: memory: 128Mi cpu: 100m type: ContainerThis ensures that even if a developer forgets to set requests, the namespace defaults kick in — preventing ghost capacity from accumulating silently.
| Technique | Savings Potential | Risk Level | Effort |
|---|---|---|---|
| Fix resource requests | 30-40% | Low | Medium |
| Scale non-prod to zero | 15-25% | Very low | Low |
| Cluster Autoscaler | 10-20% | Low | Low |
| Spot instances | 40-70% | Medium | Medium |
| Karpenter | 20-35% | Low | High |
Start with fixing resource requests and scaling non-prod to zero. These two alone typically justify the time investment in the first week.
INFORMATION📚 **References & Further Reading** * [Goldilocks Documentation](https://goldilocks.docs.fairwinds.com/) - VPA-based resource recommendations * [OpenCost](https://www.opencost.io/) - Open-source Kubernetes cost monitoring * [KEDA Documentation](https://keda.sh/docs/) - Event-driven and scheduled autoscaling * [Karpenter](https://karpenter.sh/) - AWS node provisioning with right-sizing
Scale-down is blocked by pods with local storage, pods missing PodDisruptionBudgets, pods belonging to DaemonSets, or pods with the cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation. Audit blocking pods per node using kubectl describe node and check the CA logs for 'not removable' reasons before tuning scale-down thresholds.
VPA in Auto mode and HPA scaling on CPU/memory create a feedback loop — VPA adjusts resource requests which changes the HPA's utilization denominator, triggering erratic HPA scaling. The supported pattern is to use VPA in Off or Initial mode to set correct baseline requests, and HPA to handle replica scaling using custom metrics like request rate rather than CPU percentage.
Discussion0