Kubernetes Cost Optimization: Cutting Cloud Spend Without Breaking SLOs

Kubernetes clusters routinely waste 40-70% of provisioned resources. Here is the complete playbook for cutting cloud spend without touching your SLOs.

Status: DRAFT

The Kubernetes bill arrives. It is double what was budgeted. Someone asks the engineering team what changed. Nothing specific changed — the cluster just grew because it was easier to over-provision than to tune. This is the default state of most Kubernetes clusters older than eighteen months.

Studies consistently show that 40-70% of provisioned Kubernetes resources go unused. For a company spending ₹50 lakh per month on cloud infrastructure, that is ₹20-35 lakh sitting idle every thirty days.

Why Kubernetes Clusters Overspend by Default

Kubernetes does not manage cost. It manages availability. When in doubt, Kubernetes does the safe thing: it keeps the resource alive, even if nothing is using it.

The three biggest cost drivers in a typical cluster:

Oversized resource requests: Developers set requests.memory: 2Gi and requests.cpu: 1000m out of caution. The actual usage is 200Mi and 80m. The scheduler reserves the full requested amount and nodes fill up with ghost capacity.
Idle workloads: Dev and staging environments run 24/7 even though nobody uses them between 10 PM and 9 AM.
Underutilized node groups: Fixed-size node groups provisioned for peak load sit at 15% utilization for 20 hours a day.

Each of these has a specific fix, and none of them require touching your production SLOs.

Step 1: See What You Are Actually Using

You cannot optimize what you cannot measure. Start with resource usage visibility.

Bash

## Install Kubernetes resource report tool
kubectl top nodes  ## node-level CPU and memory usage
kubectl top pods --all-namespaces  ## pod-level usage
 
## Find biggest resource consumers
kubectl top pods -A --sort-by=memory | head -20

For a proper view, deploy Goldilocks — it watches actual usage and recommends correct requests and limits:

Bash

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace
 
## Label a namespace to enable recommendations
kubectl label namespace production \
  goldilocks.fairwinds.com/enabled=true

Open the Goldilocks dashboard and you will see a table showing every deployment's actual CPU/memory usage vs its requested values — and an auto-generated recommendation for what the requests should actually be.

Step 2: Fix Resource Requests

This single step is typically worth 30-40% cost reduction for teams that have never done it.

A pod with oversized requests blocks scheduler capacity even while idle. The node reports "full" to the scheduler when it is actually at 20% real utilization.

Here is the pattern: use VPA (Vertical Pod Autoscaler) in recommendation mode to generate correct values, then apply them to your manifests.

YAML

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"  ## recommendation only, no auto-apply

Bash

## View VPA recommendations after 24 hours of data
kubectl describe vpa payment-service-vpa -n production

Apply the recommended values to your Deployment. Then do this for every service. It is tedious — it is also the highest-ROI work you will do all quarter.

Step 3: Right-Size Your Node Groups with Cluster Autoscaler

Static node groups are a cost trap. A node group provisioned for 100 pods during a Swiggy dinner-rush peak sits at 12 pods at 3 AM.

Cluster Autoscaler (CA) scales your node groups based on pending pods and removes underutilized nodes automatically.

YAML

## Cluster Autoscaler key configuration
autoDiscovery:
  clusterName: prod-cluster
 
extraArgs:
  scale-down-utilization-threshold: "0.5"  ## remove nodes below 50% use
  scale-down-delay-after-add: "10m"        ## wait before scaling down
  skip-nodes-with-local-storage: "false"
  balance-similar-node-groups: "true"

Pair Cluster Autoscaler with Karpenter if you are on AWS — Karpenter provisions exact-fit instances rather than pre-defined node types, which eliminates the "I need 4 CPUs but the only node type available is 8" waste.

Bash

## Install Karpenter (AWS)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --namespace karpenter \
  --create-namespace \
  --version "${KARPENTER_VERSION}"

Step 4: Scale Non-Production to Zero

Your staging cluster does not need to run at 2 AM on a Sunday. Dev environments do not need to run at all during weekends.

KEDA (Kubernetes Event-Driven Autoscaling) can scale deployments to zero on a schedule and scale them back up before business hours:

YAML

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: staging-api-scaledown
  namespace: staging
spec:
  scaleTargetRef:
    name: api-service
  minReplicaCount: 0  ## allows scale-to-zero
  maxReplicaCount: 5
  triggers:
    - type: cron
      metadata:
        timezone: Asia/Kolkata
        start: "30 8 * * 1-5"   ## scale up 8:30 AM weekdays
        end: "0 21 * * 1-5"     ## scale down 9 PM weekdays
        desiredReplicas: "3"

For dev namespaces, go further: use kube-downscaler to automatically scale everything to zero outside working hours across entire namespaces, not just individual deployments.

Step 5: Use Spot Instances for Stateless Workloads

Spot (AWS) / Preemptible (GCP) / Spot (Azure) instances cost 60-90% less than on-demand. For stateless workloads — API servers, workers, batch jobs — they are a direct cost lever.

The key is a correct pod disruption budget and fast restart behavior:

YAML

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2  ## always keep 2 pods alive during node drain
  selector:
    matchLabels:
      app: api-service

Pair this with a node taint for spot nodes so only workloads that tolerate interruption get scheduled there:

YAML

tolerations:
  - key: "node.kubernetes.io/spot"
    operator: "Exists"
    effect: "NoSchedule"

Never run stateful workloads (databases, Redis with persistence, Kafka) on spot nodes.

The Dashboard You Need

Wire everything together in Grafana with a cost dashboard that shows spend by namespace, by team, and by workload. OpenCost is the open-source standard for this:

Bash

helm install opencost \
  opencost/opencost \
  --namespace opencost \
  --create-namespace

OpenCost breaks down your cloud bill by Kubernetes namespace, label, and deployment — so when engineering leadership asks "which team is spending the most," you have a real answer in thirty seconds.

Production Implementation Guidelines

Don't optimize and change SLOs at the same time. Run two weeks of VPA recommendations in observation mode before applying anything to production. Compare actual resource usage at peak load (dinner time for Swiggy, market open for Zerodha, festive sale launch for Flipkart) to the VPA recommendation before trusting it.

Set LimitRange objects in every namespace to prevent new deployments from landing without resource requests:

YAML

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        memory: 512Mi
        cpu: 500m
      defaultRequest:
        memory: 128Mi
        cpu: 100m
      type: Container

This ensures that even if a developer forgets to set requests, the namespace defaults kick in — preventing ghost capacity from accumulating silently.

Trade-offs and Alternatives

Technique	Savings Potential	Risk Level	Effort
Fix resource requests	30-40%	Low	Medium
Scale non-prod to zero	15-25%	Very low	Low
Cluster Autoscaler	10-20%	Low	Low
Spot instances	40-70%	Medium	Medium
Karpenter	20-35%	Low	High

Start with fixing resource requests and scaling non-prod to zero. These two alone typically justify the time investment in the first week.

INFORMATION
📚 **References & Further Reading** * [Goldilocks Documentation](https://goldilocks.docs.fairwinds.com/) - VPA-based resource recommendations * [OpenCost](https://www.opencost.io/) - Open-source Kubernetes cost monitoring * [KEDA Documentation](https://keda.sh/docs/) - Event-driven and scheduled autoscaling * [Karpenter](https://karpenter.sh/) - AWS node provisioning with right-sizing

Frequently Asked Questions

Why does Cluster Autoscaler fail to scale down nodes even when pod utilization is below the threshold?

Scale-down is blocked by pods with local storage, pods missing PodDisruptionBudgets, pods belonging to DaemonSets, or pods with the cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation. Audit blocking pods per node using kubectl describe node and check the CA logs for 'not removable' reasons before tuning scale-down thresholds.

How do VPA and HPA conflict when both are applied to the same Deployment in Kubernetes?

VPA in Auto mode and HPA scaling on CPU/memory create a feedback loop — VPA adjusts resource requests which changes the HPA's utilization denominator, triggering erratic HPA scaling. The supported pattern is to use VPA in Off or Initial mode to set correct baseline requests, and HPA to handle replica scaling using custom metrics like request rate rather than CPU percentage.

Why Kubernetes Clusters Overspend by Default

Step 1: See What You Are Actually Using

Step 2: Fix Resource Requests

Step 3: Right-Size Your Node Groups with Cluster Autoscaler

Step 4: Scale Non-Production to Zero

Step 5: Use Spot Instances for Stateless Workloads

The Dashboard You Need

Production Implementation Guidelines

Trade-offs and Alternatives

Frequently Asked Questions

Why does Cluster Autoscaler fail to scale down nodes even when pod utilization is below the threshold?

How do VPA and HPA conflict when both are applied to the same Deployment in Kubernetes?

Discussion0