What is the career path for learning Monitoring Kubernetes with Prometheus and Grafana?

Mastering Monitoring Kubernetes with Prometheus and Grafana enables engineering opportunities in DevOps, SRE, and cloud platform automation.

How long does it take to learn Monitoring Kubernetes with Prometheus and Grafana?

Most students gain core proficiency in Monitoring Kubernetes with Prometheus and Grafana in 2–3 weeks of active hands-on labs.

Monitoring Kubernetes with Prometheus and Grafana | DevOps Network

Overview and What You Will Learn

In this guide you will set up a full Kubernetes monitoring stack using Prometheus, Grafana, and Alertmanager. You will learn how Prometheus collects metrics from pods and nodes, how to build Grafana dashboards that show the health of your cluster, and how to configure alerts that notify your team on Slack when something goes wrong. This is the same stack used by SRE teams at Zerodha, Razorpay, and Swiggy.

Why This Matters in Production

Without monitoring, you find out your cluster is struggling when users start complaining — not before. Prometheus and Grafana give you a real-time view of every pod's CPU usage, memory pressure, request rates, and error rates. When an incident happens at 3am on a Swiggy deploy, the Grafana dashboard is the first thing every engineer opens. Without it, debugging takes hours instead of minutes.

Core Principles

◈ DIAGRAM

+------------------------------------------+
| Your Application Pods                    |
| expose /metrics endpoint                 | <- Prometheus scrapes these
+------------------------------------------+
                    |
+------------------------------------------+
| Node Exporter DaemonSet                  | <- Collects node CPU/memory/disk
| (one pod per node)                       |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Prometheus                               | <- Stores all metrics in time-series DB
| Scrapes every target every 15s           |    Runs PromQL queries
+------------------------------------------+
          |                   |
          v                   v
+------------------+  +------------------+
| Grafana          |  | Alertmanager     |
| Dashboards       |  | Routes alerts    |
| PromQL charts    |  | to Slack/PagerD  |
+------------------+  +------------------+

Detailed Step-by-Step Practical Lab

Step 1: Install the Full Stack with kube-prometheus-stack Helm Chart

The easiest way to get Prometheus, Grafana, Alertmanager, and all required exporters running is with the community kube-prometheus-stack Helm chart. It installs everything pre-configured in one command.

Bash

# Add the Helm repository
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update
 
# Create a dedicated monitoring namespace
kubectl create namespace monitoring
 
# Install the full stack
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=Mumbai@2024 \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
 
# Verify all pods are running
kubectl get pods -n monitoring
# prometheus-kube-prometheus-stack-prometheus-0   2/2  Running
# alertmanager-kube-prometheus-stack-alertmanager-0  2/2  Running
# kube-prometheus-stack-grafana-6d8f9b-xkp2q  3/3  Running
# kube-prometheus-stack-kube-state-metrics-xxx 1/1  Running
# kube-prometheus-stack-node-exporter-<node>   2/2  Running (one per node)

Step 2: Access Grafana and Explore Pre-Built Dashboards

Bash

# Port forward Grafana to your local machine
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
 
# Open http://localhost:3000 in your browser
# Username: admin
# Password: Mumbai@2024 (set during install)
 
# The install includes pre-built dashboards:
# Dashboard ID 315  — Kubernetes Cluster Overview
# Dashboard ID 6417 — Kubernetes Pods
# Dashboard ID 1860 — Node Exporter Full (node-level metrics)
# Dashboard ID 13332 — Kubernetes Resource Requests
 
# Import additional dashboards
# Grafana UI -> Dashboards -> Import -> Enter dashboard ID

Step 3: Make Your Application Expose Metrics for Prometheus

For Prometheus to scrape your application, it needs a /metrics endpoint in Prometheus format. Then you create a ServiceMonitor object that tells Prometheus where to scrape.

YAML

# Add metrics endpoint to your Node.js app using prom-client library
# In your app code:
# const client = require('prom-client')
# const register = new client.Registry()
# client.collectDefaultMetrics({ register })
# app.get('/metrics', async (req, res) => {
#   res.set('Content-Type', register.contentType)
#   res.send(await register.metrics())
# })
 
# servicemonitor.yaml — tells Prometheus to scrape your app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: order-api-metrics
  namespace: production
  labels:
    release: kube-prometheus-stack    # Must match the Prometheus selector
spec:
  selector:
    matchLabels:
      app: order-api                  # Selects your Service
  endpoints:
    - port: http                      # Named port on your Service
      path: /metrics                  # Metrics endpoint path
      interval: 15s                   # Scrape every 15 seconds
  namespaceSelector:
    matchNames:
      - production

Step 4: Writing Useful PromQL Queries

PromQL is the query language for Prometheus. These are the most useful queries for Kubernetes monitoring:

Bash

# CPU usage per pod (%) over last 5 minutes
rate(container_cpu_usage_seconds_total{namespace="production"}[5m]) * 100
 
# Memory usage per pod in MB
container_memory_usage_bytes{namespace="production"} / 1024 / 1024
 
# Pod restart count — pods restarting frequently are unstable
kube_pod_container_status_restarts_total{namespace="production"} > 5
 
# HTTP error rate — percentage of 5xx responses
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) * 100
 
# Node memory pressure
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100
 
# Number of pending pods — indicates scheduling problems
kube_pod_status_phase{phase="Pending"} == 1

Step 5: Configure Alertmanager to Send Slack Notifications

YAML

# alertmanager-config.yaml — send critical alerts to Slack
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-kube-prometheus-stack-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack-production'
      routes:
        - match:
            severity: critical
          receiver: 'slack-production'
        - match:
            severity: warning
          receiver: 'slack-warnings'
 
    receivers:
      - name: 'slack-production'
        slack_configs:
          - channel: '#prod-alerts'
            title: 'CRITICAL: {{ .GroupLabels.alertname }}'
            text: |
              Namespace: {{ .GroupLabels.namespace }}
              {{ range .Alerts }}
              Summary: {{ .Annotations.summary }}
              Description: {{ .Annotations.description }}
              {{ end }}
            send_resolved: true
 
      - name: 'slack-warnings'
        slack_configs:
          - channel: '#prod-warnings'
            title: 'WARNING: {{ .GroupLabels.alertname }}'
            send_resolved: true

Bash

# Apply the Alertmanager config
kubectl apply -f alertmanager-config.yaml -n monitoring
 
# Verify Alertmanager loaded the config
kubectl port-forward svc/kube-prometheus-stack-alertmanager 9093:9093 -n monitoring
# Open http://localhost:9093 to see Alertmanager status and active alerts

Production Best Practices & Common Pitfalls

Set Prometheus storage retention to at least 15 days and use a persistent volume (not emptyDir). If Prometheus restarts with emptyDir storage, all historical metrics are lost — your dashboards show gaps and incident post-mortems become impossible.
Create a dedicated Grafana dashboard for each team namespace showing their pods' CPU, memory, and HTTP error rates. At Swiggy, each squad has their own dashboard linked from their runbook — this dramatically reduces MTTR during incidents because engineers go straight to their service metrics.

COMMON MISTAKE / WARNING
**Common Mistake:** Scraping every container at 5-second intervals in a large cluster. With 500 pods, a 5-second scrape interval generates enormous write load on Prometheus and can cause it to fall behind on ingestion. Use 15 seconds as the default and only drop to 5 seconds for critical latency-sensitive metrics.

Quick Reference & Troubleshooting Commands

Command	Purpose
`kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring`	Access Grafana locally
`kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring`	Access Prometheus UI
`kubectl get servicemonitor -n production`	List ServiceMonitors that Prometheus is scraping
`kubectl get prometheusrule -n monitoring`	List all alert rules
`kubectl logs prometheus-kube-prometheus-stack-prometheus-0 -n monitoring`	View Prometheus logs

JSON

{
  "title": "Scaling Kubernetes Nodes Automatically with Cluster Autoscaler on AWS EKS",
  "slug": "kubernetes-cluster-autoscaler-eks",
  "cluster": "kubernetes",
  "description": "Configure Kubernetes Cluster Autoscaler on AWS EKS to automatically add and remove worker nodes based on pending pods and resource utilization, cutting cloud costs by up to 40%.",
  "primaryKeyword": "kubernetes cluster autoscaler eks"
}

Scaling Kubernetes Nodes Automatically with Cluster Autoscaler on AWS EKS

Overview and What You Will Learn

In this guide you will set up the Kubernetes Cluster Autoscaler (CA) on AWS EKS. You will learn the difference between HPA (scaling pods) and CA (scaling nodes), how to configure CA to add nodes when pods are stuck in Pending, how to set safe scale-down policies that do not disrupt running workloads, and how to combine CA with Spot instances to reduce cloud costs by 40-70%.

Why This Matters in Production

HPA scales your pods when traffic increases — but if your nodes do not have enough capacity to schedule those new pods, they stay stuck in Pending forever. Cluster Autoscaler is what adds the actual nodes. At Hotstar during an IPL match when traffic spikes 20x in minutes, CA is what automatically provisions new EC2 nodes to handle the pods that HPA created — without any human intervention.

Core Principles

◈ DIAGRAM

+------------------------------------------+
| Traffic spike — HPA adds new pods        | <- Pod scaling: HPA handles this
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| New pods stuck in Pending                | <- No node has free capacity
| Not enough node resources                |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Cluster Autoscaler detects Pending pods  | <- Node scaling: CA handles this
| Calculates which node group to expand   |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| CA requests new EC2 node from AWS        | <- Calls EC2 Auto Scaling Group API
| Node joins cluster in ~2 minutes         |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Pending pods scheduled on new node       | <- Traffic served successfully
+------------------------------------------+

Detailed Step-by-Step Practical Lab

Step 1: Create EKS Node Group with Auto Scaling

Bash

# Create a managed node group with Auto Scaling boundaries
aws eks create-nodegroup \
  --cluster-name mumbai-prod-cluster \
  --nodegroup-name production-nodes \
  --node-role arn:aws:iam::905418385260:role/eks-node-role \
  --subnets subnet-0a1b2c3d subnet-0e5f6g7h \
  --instance-types t3.large t3.xlarge \
  --ami-type AL2_x86_64 \
  --scaling-config minSize=3,maxSize=20,desiredSize=3
 
# Add the required auto-discovery tags to the node group
# CA uses these tags to find which node groups it can scale
aws autoscaling create-or-update-tags \
  --tags \
    ResourceId=eks-production-nodes-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true \
    ResourceId=eks-production-nodes-asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/mumbai-prod-cluster,Value=owned,PropagateAtLaunch=true

Step 2: Create IAM Role for Cluster Autoscaler

CA needs permission to describe and modify EC2 Auto Scaling Groups:

JSON

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeImages",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplateVersions",
        "eks:DescribeNodegroup"
      ],
      "Resource": "*"
    }
  ]
}

Bash

# Create the IAM role and attach the policy (using IRSA — IAM Roles for Service Accounts)
eksctl create iamserviceaccount \
  --cluster=mumbai-prod-cluster \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=arn:aws:iam::905418385260:policy/ClusterAutoscalerPolicy \
  --override-existing-serviceaccounts \
  --approve \
  --region=ap-south-1

Step 3: Deploy Cluster Autoscaler

YAML

# cluster-autoscaler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler   # The IRSA service account
      containers:
        - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
          name: cluster-autoscaler
          resources:
            requests:
              cpu: 100m
              memory: 600Mi
            limits:
              cpu: 100m
              memory: 600Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste        # Scale up node group with most free space
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled=true,k8s.io/cluster-autoscaler/mumbai-prod-cluster=owned
            - --balance-similar-node-groups  # Keep node groups balanced
            - --skip-nodes-with-system-pods=false
            - --scale-down-enabled=true
            - --scale-down-delay-after-add=10m   # Wait 10 min after scale-up before scaling down
            - --scale-down-unneeded-time=10m     # Node must be unneeded for 10 min before removal
            - --scale-down-utilization-threshold=0.5  # Remove node if below 50% utilization

Bash

kubectl apply -f cluster-autoscaler-deployment.yaml
 
# Verify CA is running and connected to AWS
kubectl logs deployment/cluster-autoscaler -n kube-system | grep -i "successfully"

Step 4: Test Cluster Autoscaler is Working

Bash

# Create a deployment that needs more resources than currently available
kubectl create deployment scale-test \
  --image=nginx \
  --replicas=50 \
  -n production
 
# Watch pods — some will be Pending because nodes are full
kubectl get pods -n production | grep Pending
 
# Watch CA logs to see it detecting Pending pods and requesting new nodes
kubectl logs -f deployment/cluster-autoscaler -n kube-system | grep -i "scale up"
 
# Watch new nodes joining the cluster
kubectl get nodes -w
# After 2-3 minutes new nodes appear with STATUS: Ready
 
# Clean up after test
kubectl delete deployment scale-test -n production
# After 10 minutes CA will remove the extra nodes (scale-down-unneeded-time)

Step 5: Protect Critical Pods from Scale-Down

CA removes idle nodes by evicting pods first. Use these annotations to protect critical workloads:

YAML

# Prevent CA from evicting this pod during scale-down
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    # Use for: Prometheus, stateful workloads, anything that takes >5 min to restart
 
# Or use a PodDisruptionBudget to ensure minimum replicas during scale-down
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
  namespace: production
spec:
  minAvailable: 3           # CA will not remove a node if it would drop below 3 pods
  selector:
    matchLabels:
      app: payment-api

Production Best Practices & Common Pitfalls

Set scale-down-delay-after-add to at least 10 minutes. Without this delay, CA may remove a node that was just added 2 minutes ago because traffic momentarily dropped — causing a thrashing cycle of add/remove that wastes money and destabilises the cluster.
Use multiple instance types in your node group (e.g., t3.large,t3.xlarge,m5.large). If AWS has limited capacity in the AZ for one instance type, CA can provision a different one. Single instance type node groups get stuck when AWS runs out of that specific instance in your availability zone.

COMMON MISTAKE / WARNING
**Common Mistake:** Running CA without PodDisruptionBudgets on stateful workloads. When CA removes a node, it evicts all pods on it simultaneously. If you have a 3-replica StatefulSet without a PDB, CA can evict all 3 pods at once — causing a full service outage during what should be a routine scale-down. Always pair CA with PDBs on any workload where availability matters.

Quick Reference & Troubleshooting Commands

Command	Purpose
`kubectl logs deployment/cluster-autoscaler -n kube-system`	View CA activity and decisions
`kubectl get nodes -w`	Watch nodes being added and removed
`kubectl describe pod <pending-pod>`	See why a pod is Pending (insufficient resources)
`kubectl get events -n kube-system \| grep cluster-autoscaler`	CA events
`kubectl annotate pod <name> cluster-autoscaler.kubernetes.io/safe-to-evict=false`	Protect a pod from CA eviction

Monitoring Kubernetes with Prometheus and Grafana

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Step 1: Install the Full Stack with kube-prometheus-stack Helm Chart

Step 2: Access Grafana and Explore Pre-Built Dashboards

Step 3: Make Your Application Expose Metrics for Prometheus

Step 4: Writing Useful PromQL Queries

Step 5: Configure Alertmanager to Send Slack Notifications

Production Best Practices & Common Pitfalls

Quick Reference & Troubleshooting Commands

Scaling Kubernetes Nodes Automatically with Cluster Autoscaler on AWS EKS

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Step 1: Create EKS Node Group with Auto Scaling

Step 2: Create IAM Role for Cluster Autoscaler

Step 3: Deploy Cluster Autoscaler

Step 4: Test Cluster Autoscaler is Working

Step 5: Protect Critical Pods from Scale-Down

Production Best Practices & Common Pitfalls

Quick Reference & Troubleshooting Commands

Resources

Explore More in Kubernetes Networking and Traffic Management

Configuring Ingress Controllers with NGINX for Production Traffic

Debugging Kubernetes Networking with kubectl and CNI Plugins

Kubernetes Network Policies for Pod-Level Traffic Control

Implementing Kubernetes Deployment Strategies - Rolling, Blue-Green, and Canary