Grafana — Turning Raw Metrics Into Actionable Dashboards
What is Grafana in Simple Terms?
Prometheus collects thousands of metrics every 15 seconds but stores them as raw numbers in a time-series database. Reading raw numbers is useless during an incident at 3am. Grafana connects to Prometheus and turns those numbers into visual dashboards — line charts, gauges, heatmaps, and alert panels — so engineers can understand cluster health at a glance.
At Zerodha, during market open at 9:15am when trading volume spikes 50x, the SRE team watches a Grafana dashboard showing pod CPU, request rate, error percentage, and HPA replica count all on one screen. Without Grafana, every metric requires a separate PromQL query typed manually.
How Grafana Connects to Prometheus
+------------------------------------------+| Prometheus || Scrapes metrics every 15 seconds || Stores in time-series database |+------------------------------------------+ | | PromQL queries | (Grafana asks: "give me CPU usage") v+------------------------------------------+| Grafana || Runs PromQL queries against Prometheus || Renders results as charts and panels || Refreshes automatically (every 30s) |+------------------------------------------+ | v+------------------------------------------+| Engineer opens browser || Sees live dashboard of cluster health || Can drill down into any service or pod |+------------------------------------------+Installing Grafana on Kubernetes
The easiest way is through the kube-prometheus-stack Helm chart which installs Prometheus, Grafana, and Alertmanager together:
helm repo add prometheus-community \ https://prometheus-community.github.io/helm-chartshelm repo update helm install kube-prometheus-stack \ prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set grafana.adminPassword=Mumbai@2024 \ --set grafana.persistence.enabled=true \ --set grafana.persistence.size=10Gi # Verify Grafana is runningkubectl get pods -n monitoring -l app.kubernetes.io/name=grafana# NAME READY STATUS# kube-prometheus-stack-grafana-6d8f9b-xxx 3/3 Running # Access Grafana in your browserkubectl port-forward svc/kube-prometheus-stack-grafana \ 3000:80 -n monitoring# Open http://localhost:3000# Username: admin | Password: Mumbai@2024Pre-Built Dashboard IDs to Import Immediately
Grafana has a public dashboard library. Import these by ID in Grafana UI under Dashboards -> Import:
| Dashboard ID | What It Shows | Use It For |
|---|---|---|
| 315 | Kubernetes Cluster Overview | Node health, pod counts, resource pressure |
| 6417 | Kubernetes Pods | CPU, memory, restart count per pod |
| 1860 | Node Exporter Full | Node-level disk, network I/O, CPU steal |
| 13332 | Resource Requests vs Limits | Which pods are over-provisioned or under-provisioned |
| 7249 | Kubernetes Cluster Autoscaler | Node scaling events and decisions |
| 3119 | Kubernetes Deployment | Rolling update status and replica health |
# Import a dashboard by ID:# Grafana UI -> Left sidebar -> Dashboards -> Import# Enter ID -> Load -> Select Prometheus as data source -> ImportBuilding a Custom Dashboard for Your Service
Every team at Swiggy or Razorpay has their own namespace-scoped dashboard. Here is how to build one:
# Step 1 — Create a new dashboard# Grafana UI -> Dashboards -> New Dashboard -> Add Visualisation # Step 2 — Connect to Prometheus data source# Select Prometheus from the data source dropdown # Step 3 — Write PromQL queries for your panelsThe five most useful panels for any service dashboard:
# Panel 1 — Request rate (requests per second)rate(http_requests_total{namespace="production", service="payment-api"}[5m]) # Panel 2 — Error rate percentage (5xx responses)rate(http_requests_total{namespace="production", status=~"5.."}[5m]) /rate(http_requests_total{namespace="production"}[5m]) * 100 # Panel 3 — CPU usage per podrate(container_cpu_usage_seconds_total{ namespace="production", pod=~"payment-api-.*"}[5m]) * 100 # Panel 4 — Memory usage per pod (in MB)container_memory_usage_bytes{ namespace="production", pod=~"payment-api-.*"} / 1024 / 1024 # Panel 5 — Pod restart count (restarts in last hour)increase(kube_pod_container_status_restarts_total{ namespace="production"}[1h]) > 0Configuring Alerts in Grafana
Grafana can send alerts when a metric crosses a threshold — before users start complaining:
# In Grafana UI:# Open any panel -> Edit -> Alert tab -> Create Alert Rule # Example: Alert when error rate exceeds 5%# Condition: WHEN last() OF query(A, 5m, now) IS ABOVE 5# Evaluate every: 1m# For: 2m (alert only fires if condition holds for 2 minutes)# Notification: Send to Slack #prod-alerts channel# Or define alerts as code using PrometheusRule CRDapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: payment-api-alerts namespace: productionspec: groups: * name: payment-api rules: * alert: HighErrorRate expr: | rate(http_requests_total{status=~"5..", service="payment-api"}[5m]) / rate(http_requests_total{service="payment-api"}[5m]) > 0.05 for: 2m labels: severity: critical team: payments annotations: summary: "Payment API error rate above 5%" description: "Error rate is {{ $value | humanizePercentage }}"Grafana Key Concepts
| Concept | What It Is |
|---|---|
| Dashboard | A collection of panels showing related metrics |
| Panel | A single chart, gauge, table, or stat widget |
| Data Source | Where Grafana fetches data from (Prometheus, Loki, etc.) |
| Variable | A dropdown filter that changes all panels at once (e.g. namespace selector) |
| Annotation | A vertical line on a chart marking a deployment or incident |
| Folder | Organises dashboards by team or service |
Using Dashboard Variables for Multi-Service Views
Variables let one dashboard serve all namespaces:
# In Dashboard Settings -> Variables -> Add Variable:# Name: namespace# Type: Query# Query: label_values(kube_pod_info, namespace)# This creates a dropdown of all namespaces # Now use $namespace in all PromQL queries:rate(http_requests_total{namespace="$namespace"}[5m])# Engineers select their namespace from the dropdown# All panels filter to that namespace automaticallyTroubleshooting Common Grafana Issues
| Problem | Likely Cause | Fix |
|---|---|---|
| No data on panels | Prometheus data source not configured | Settings -> Data Sources -> Add Prometheus -> URL: http://prometheus:9090 |
| Dashboard shows N/A | PromQL query returns no results | Check metric name with kubectl exec -it <prometheus-pod> -- promtool query instant |
| Panels not refreshing | Auto-refresh not set | Dashboard top right -> set refresh interval to 30s |
| Alert not firing | Alertmanager not connected | Check Grafana -> Alerting -> Contact Points are configured |
| Dashboard loads slowly | Too many panels or short time range | Reduce panel count or increase minimum step interval |
# Check Grafana logs for data source connection errorskubectl logs -n monitoring \ -l app.kubernetes.io/name=grafana \ -c grafana --tail=50 # Restart Grafana if panels stop loadingkubectl rollout restart deployment/kube-prometheus-stack-grafana \ -n monitoringPLACEMENT PRO TIP**Tip:** Create a dedicated Grafana folder per team — Payments, Orders, Delivery, Platform. Each squad owns their dashboards. Link the primary dashboard URL directly in the team runbook so engineers hit the right dashboard in under 10 seconds during an incident. At Swiggy scale, finding the right dashboard under pressure costs precious MTTR minutes.
REMEMBER THIS**Remember:** Grafana dashboards do not store data — they only visualise what is in Prometheus. If Prometheus has 15 days of retention and you need to investigate an incident from 20 days ago, the data is gone. Set Prometheus retention to at least 30 days and consider long-term storage with Thanos or Grafana Mimir for compliance and capacity planning.
COMMON MISTAKE / WARNING**Common Mistake:** Using the same Grafana dashboard for all environments (dev, staging, production) without namespace variables. Engineers accidentally look at staging metrics while debugging a production incident. Always use dashboard variables so the namespace is explicit — and set the default variable value to `production` so the dangerous environment is always the default view.
COMMON MISTAKE / WARNING**Security:** Grafana's admin account should never be used by individual engineers. Create organisation-level users with viewer or editor roles. On Razorpay-scale platforms, connect Grafana to your SSO provider (Okta, Google Workspace) so access is controlled by the same identity system as everything else — and revoked automatically when engineers leave the company.