Overview and What You Will Learn
In this guide you will set up a full Kubernetes monitoring stack using Prometheus, Grafana, and Alertmanager. You will learn how Prometheus collects metrics from pods and nodes, how to build Grafana dashboards that show the health of your cluster, and how to configure alerts that notify your team on Slack when something goes wrong. This is the same stack used by SRE teams at Zerodha, Razorpay, and Swiggy.
Why This Matters in Production
Without monitoring, you find out your cluster is struggling when users start complaining β not before. Prometheus and Grafana give you a real-time view of every pod's CPU usage, memory pressure, request rates, and error rates. When an incident happens at 3am on a Swiggy deploy, the Grafana dashboard is the first thing every engineer opens. Without it, debugging takes hours instead of minutes.
Core Principles
+------------------------------------------+| Your Application Pods || expose /metrics endpoint | <- Prometheus scrapes these+------------------------------------------+ |+------------------------------------------+| Node Exporter DaemonSet | <- Collects node CPU/memory/disk| (one pod per node) |+------------------------------------------+ | v+------------------------------------------+| Prometheus | <- Stores all metrics in time-series DB| Scrapes every target every 15s | Runs PromQL queries+------------------------------------------+ | | v v+------------------+ +------------------+| Grafana | | Alertmanager || Dashboards | | Routes alerts || PromQL charts | | to Slack/PagerD |+------------------+ +------------------+Detailed Step-by-Step Practical Lab
Step 1: Install the Full Stack with kube-prometheus-stack Helm Chart
The easiest way to get Prometheus, Grafana, Alertmanager, and all required exporters running is with the community kube-prometheus-stack Helm chart. It installs everything pre-configured in one command.
# Add the Helm repositoryhelm repo add prometheus-community \ https://prometheus-community.github.io/helm-chartshelm repo update # Create a dedicated monitoring namespacekubectl create namespace monitoring # Install the full stackhelm install kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set grafana.adminPassword=Mumbai@2024 \ --set prometheus.prometheusSpec.retention=15d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi # Verify all pods are runningkubectl get pods -n monitoring# prometheus-kube-prometheus-stack-prometheus-0 2/2 Running# alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running# kube-prometheus-stack-grafana-6d8f9b-xkp2q 3/3 Running# kube-prometheus-stack-kube-state-metrics-xxx 1/1 Running# kube-prometheus-stack-node-exporter-<node> 2/2 Running (one per node)Step 2: Access Grafana and Explore Pre-Built Dashboards
# Port forward Grafana to your local machinekubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring # Open http://localhost:3000 in your browser# Username: admin# Password: Mumbai@2024 (set during install) # The install includes pre-built dashboards:# Dashboard ID 315 β Kubernetes Cluster Overview# Dashboard ID 6417 β Kubernetes Pods# Dashboard ID 1860 β Node Exporter Full (node-level metrics)# Dashboard ID 13332 β Kubernetes Resource Requests # Import additional dashboards# Grafana UI -> Dashboards -> Import -> Enter dashboard IDStep 3: Make Your Application Expose Metrics for Prometheus
For Prometheus to scrape your application, it needs a /metrics endpoint in Prometheus format. Then you create a ServiceMonitor object that tells Prometheus where to scrape.
# Add metrics endpoint to your Node.js app using prom-client library# In your app code:# const client = require('prom-client')# const register = new client.Registry()# client.collectDefaultMetrics({ register })# app.get('/metrics', async (req, res) => {# res.set('Content-Type', register.contentType)# res.send(await register.metrics())# }) # servicemonitor.yaml β tells Prometheus to scrape your appapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: order-api-metrics namespace: production labels: release: kube-prometheus-stack # Must match the Prometheus selectorspec: selector: matchLabels: app: order-api # Selects your Service endpoints: - port: http # Named port on your Service path: /metrics # Metrics endpoint path interval: 15s # Scrape every 15 seconds namespaceSelector: matchNames: - productionStep 4: Writing Useful PromQL Queries
PromQL is the query language for Prometheus. These are the most useful queries for Kubernetes monitoring:
# CPU usage per pod (%) over last 5 minutesrate(container_cpu_usage_seconds_total{namespace="production"}[5m]) * 100 # Memory usage per pod in MBcontainer_memory_usage_bytes{namespace="production"} / 1024 / 1024 # Pod restart count β pods restarting frequently are unstablekube_pod_container_status_restarts_total{namespace="production"} > 5 # HTTP error rate β percentage of 5xx responsesrate(http_requests_total{status=~"5.."}[5m]) /rate(http_requests_total[5m]) * 100 # Node memory pressure(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /node_memory_MemTotal_bytes * 100 # Number of pending pods β indicates scheduling problemskube_pod_status_phase{phase="Pending"} == 1Step 5: Configure Alertmanager to Send Slack Notifications
# alertmanager-config.yaml β send critical alerts to SlackapiVersion: v1kind: Secretmetadata: name: alertmanager-kube-prometheus-stack-alertmanager namespace: monitoringstringData: alertmanager.yaml: | global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' route: group_by: ['alertname', 'namespace'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'slack-production' routes: - match: severity: critical receiver: 'slack-production' - match: severity: warning receiver: 'slack-warnings' receivers: - name: 'slack-production' slack_configs: - channel: '#prod-alerts' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: | Namespace: {{ .GroupLabels.namespace }} {{ range .Alerts }} Summary: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }} send_resolved: true - name: 'slack-warnings' slack_configs: - channel: '#prod-warnings' title: 'WARNING: {{ .GroupLabels.alertname }}' send_resolved: true# Apply the Alertmanager configkubectl apply -f alertmanager-config.yaml -n monitoring # Verify Alertmanager loaded the configkubectl port-forward svc/kube-prometheus-stack-alertmanager 9093:9093 -n monitoring# Open http://localhost:9093 to see Alertmanager status and active alertsProduction Best Practices & Common Pitfalls
- Set Prometheus storage retention to at least 15 days and use a persistent volume (not emptyDir). If Prometheus restarts with emptyDir storage, all historical metrics are lost β your dashboards show gaps and incident post-mortems become impossible.
- Create a dedicated Grafana dashboard for each team namespace showing their pods' CPU, memory, and HTTP error rates. At Swiggy, each squad has their own dashboard linked from their runbook β this dramatically reduces MTTR during incidents because engineers go straight to their service metrics.
COMMON MISTAKE / WARNING**Common Mistake:** Scraping every container at 5-second intervals in a large cluster. With 500 pods, a 5-second scrape interval generates enormous write load on Prometheus and can cause it to fall behind on ingestion. Use 15 seconds as the default and only drop to 5 seconds for critical latency-sensitive metrics.
Quick Reference & Troubleshooting Commands
| Command | Purpose |
|---|---|
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring |
Access Grafana locally |
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring |
Access Prometheus UI |
kubectl get servicemonitor -n production |
List ServiceMonitors that Prometheus is scraping |
kubectl get prometheusrule -n monitoring |
List all alert rules |
kubectl logs prometheus-kube-prometheus-stack-prometheus-0 -n monitoring |
View Prometheus logs |
{ "title": "Scaling Kubernetes Nodes Automatically with Cluster Autoscaler on AWS EKS", "slug": "kubernetes-cluster-autoscaler-eks", "cluster": "kubernetes", "description": "Configure Kubernetes Cluster Autoscaler on AWS EKS to automatically add and remove worker nodes based on pending pods and resource utilization, cutting cloud costs by up to 40%.", "primaryKeyword": "kubernetes cluster autoscaler eks"}Scaling Kubernetes Nodes Automatically with Cluster Autoscaler on AWS EKS
Overview and What You Will Learn
In this guide you will set up the Kubernetes Cluster Autoscaler (CA) on AWS EKS. You will learn the difference between HPA (scaling pods) and CA (scaling nodes), how to configure CA to add nodes when pods are stuck in Pending, how to set safe scale-down policies that do not disrupt running workloads, and how to combine CA with Spot instances to reduce cloud costs by 40-70%.
Why This Matters in Production
HPA scales your pods when traffic increases β but if your nodes do not have enough capacity to schedule those new pods, they stay stuck in Pending forever. Cluster Autoscaler is what adds the actual nodes. At Hotstar during an IPL match when traffic spikes 20x in minutes, CA is what automatically provisions new EC2 nodes to handle the pods that HPA created β without any human intervention.
Core Principles
+------------------------------------------+| Traffic spike β HPA adds new pods | <- Pod scaling: HPA handles this+------------------------------------------+ | v+------------------------------------------+| New pods stuck in Pending | <- No node has free capacity| Not enough node resources |+------------------------------------------+ | v+------------------------------------------+| Cluster Autoscaler detects Pending pods | <- Node scaling: CA handles this| Calculates which node group to expand |+------------------------------------------+ | v+------------------------------------------+| CA requests new EC2 node from AWS | <- Calls EC2 Auto Scaling Group API| Node joins cluster in ~2 minutes |+------------------------------------------+ | v+------------------------------------------+| Pending pods scheduled on new node | <- Traffic served successfully+------------------------------------------+Detailed Step-by-Step Practical Lab
Step 1: Create EKS Node Group with Auto Scaling
# Create a managed node group with Auto Scaling boundariesaws eks create-nodegroup \ --cluster-name mumbai-prod-cluster \ --nodegroup-name production-nodes \ --node-role arn:aws:iam::905418385260:role/eks-node-role \ --subnets subnet-0a1b2c3d subnet-0e5f6g7h \ --instance-types t3.large t3.xlarge \ --ami-type AL2_x86_64 \ --scaling-config minSize=3,maxSize=20,desiredSize=3 # Add the required auto-discovery tags to the node group# CA uses these tags to find which node groups it can scaleaws autoscaling create-or-update-tags \ --tags \ ResourceId=eks-production-nodes-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true \ ResourceId=eks-production-nodes-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/mumbai-prod-cluster,Value=owned,PropagateAtLaunch=trueStep 2: Create IAM Role for Cluster Autoscaler
CA needs permission to describe and modify EC2 Auto Scaling Groups:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "autoscaling:DescribeLaunchConfigurations", "autoscaling:DescribeScalingActivities", "autoscaling:SetDesiredCapacity", "autoscaling:TerminateInstanceInAutoScalingGroup", "ec2:DescribeImages", "ec2:DescribeInstanceTypes", "ec2:DescribeLaunchTemplateVersions", "eks:DescribeNodegroup" ], "Resource": "*" } ]}# Create the IAM role and attach the policy (using IRSA β IAM Roles for Service Accounts)eksctl create iamserviceaccount \ --cluster=mumbai-prod-cluster \ --namespace=kube-system \ --name=cluster-autoscaler \ --attach-policy-arn=arn:aws:iam::905418385260:policy/ClusterAutoscalerPolicy \ --override-existing-serviceaccounts \ --approve \ --region=ap-south-1Step 3: Deploy Cluster Autoscaler
# cluster-autoscaler-deployment.yamlapiVersion: apps/v1kind: Deploymentmetadata: name: cluster-autoscaler namespace: kube-system labels: app: cluster-autoscalerspec: replicas: 1 selector: matchLabels: app: cluster-autoscaler template: metadata: labels: app: cluster-autoscaler spec: serviceAccountName: cluster-autoscaler # The IRSA service account containers: - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0 name: cluster-autoscaler resources: requests: cpu: 100m memory: 600Mi limits: cpu: 100m memory: 600Mi command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste # Scale up node group with most free space - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled=true,k8s.io/cluster-autoscaler/mumbai-prod-cluster=owned - --balance-similar-node-groups # Keep node groups balanced - --skip-nodes-with-system-pods=false - --scale-down-enabled=true - --scale-down-delay-after-add=10m # Wait 10 min after scale-up before scaling down - --scale-down-unneeded-time=10m # Node must be unneeded for 10 min before removal - --scale-down-utilization-threshold=0.5 # Remove node if below 50% utilizationkubectl apply -f cluster-autoscaler-deployment.yaml # Verify CA is running and connected to AWSkubectl logs deployment/cluster-autoscaler -n kube-system | grep -i "successfully"Step 4: Test Cluster Autoscaler is Working
# Create a deployment that needs more resources than currently availablekubectl create deployment scale-test \ --image=nginx \ --replicas=50 \ -n production # Watch pods β some will be Pending because nodes are fullkubectl get pods -n production | grep Pending # Watch CA logs to see it detecting Pending pods and requesting new nodeskubectl logs -f deployment/cluster-autoscaler -n kube-system | grep -i "scale up" # Watch new nodes joining the clusterkubectl get nodes -w# After 2-3 minutes new nodes appear with STATUS: Ready # Clean up after testkubectl delete deployment scale-test -n production# After 10 minutes CA will remove the extra nodes (scale-down-unneeded-time)Step 5: Protect Critical Pods from Scale-Down
CA removes idle nodes by evicting pods first. Use these annotations to protect critical workloads:
# Prevent CA from evicting this pod during scale-downmetadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false" # Use for: Prometheus, stateful workloads, anything that takes >5 min to restart # Or use a PodDisruptionBudget to ensure minimum replicas during scale-downapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: payment-api-pdb namespace: productionspec: minAvailable: 3 # CA will not remove a node if it would drop below 3 pods selector: matchLabels: app: payment-apiProduction Best Practices & Common Pitfalls
- Set
scale-down-delay-after-addto at least 10 minutes. Without this delay, CA may remove a node that was just added 2 minutes ago because traffic momentarily dropped β causing a thrashing cycle of add/remove that wastes money and destabilises the cluster. - Use multiple instance types in your node group (e.g.,
t3.large,t3.xlarge,m5.large). If AWS has limited capacity in the AZ for one instance type, CA can provision a different one. Single instance type node groups get stuck when AWS runs out of that specific instance in your availability zone.
COMMON MISTAKE / WARNING**Common Mistake:** Running CA without PodDisruptionBudgets on stateful workloads. When CA removes a node, it evicts all pods on it simultaneously. If you have a 3-replica StatefulSet without a PDB, CA can evict all 3 pods at once β causing a full service outage during what should be a routine scale-down. Always pair CA with PDBs on any workload where availability matters.
Quick Reference & Troubleshooting Commands
| Command | Purpose |
|---|---|
kubectl logs deployment/cluster-autoscaler -n kube-system |
View CA activity and decisions |
kubectl get nodes -w |
Watch nodes being added and removed |
kubectl describe pod <pending-pod> |
See why a pod is Pending (insufficient resources) |
kubectl get events -n kube-system | grep cluster-autoscaler |
CA events |
kubectl annotate pod <name> cluster-autoscaler.kubernetes.io/safe-to-evict=false |
Protect a pod from CA eviction |