What is Grafana? | DevOps Dictionary

Grafana — Turning Raw Metrics Into Actionable Dashboards

What is Grafana in Simple Terms?

Prometheus collects thousands of metrics every 15 seconds but stores them as raw numbers in a time-series database. Reading raw numbers is useless during an incident at 3am. Grafana connects to Prometheus and turns those numbers into visual dashboards — line charts, gauges, heatmaps, and alert panels — so engineers can understand cluster health at a glance.

At Zerodha, during market open at 9:15am when trading volume spikes 50x, the SRE team watches a Grafana dashboard showing pod CPU, request rate, error percentage, and HPA replica count all on one screen. Without Grafana, every metric requires a separate PromQL query typed manually.

How Grafana Connects to Prometheus

◈ DIAGRAM

+------------------------------------------+
| Prometheus                               |
| Scrapes metrics every 15 seconds         |
| Stores in time-series database           |
+------------------------------------------+
                    |
                    | PromQL queries
                    | (Grafana asks: "give me CPU usage")
                    v
+------------------------------------------+
| Grafana                                  |
| Runs PromQL queries against Prometheus   |
| Renders results as charts and panels     |
| Refreshes automatically (every 30s)      |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Engineer opens browser                   |
| Sees live dashboard of cluster health    |
| Can drill down into any service or pod   |
+------------------------------------------+

Installing Grafana on Kubernetes

The easiest way is through the kube-prometheus-stack Helm chart which installs Prometheus, Grafana, and Alertmanager together:

Bash

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update
 
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=Mumbai@2024 \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=10Gi
 
# Verify Grafana is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
# NAME                                      READY   STATUS
# kube-prometheus-stack-grafana-6d8f9b-xxx  3/3     Running
 
# Access Grafana in your browser
kubectl port-forward svc/kube-prometheus-stack-grafana \
  3000:80 -n monitoring
# Open http://localhost:3000
# Username: admin | Password: Mumbai@2024

Pre-Built Dashboard IDs to Import Immediately

Grafana has a public dashboard library. Import these by ID in Grafana UI under Dashboards -> Import:

Dashboard ID	What It Shows	Use It For
315	Kubernetes Cluster Overview	Node health, pod counts, resource pressure
6417	Kubernetes Pods	CPU, memory, restart count per pod
1860	Node Exporter Full	Node-level disk, network I/O, CPU steal
13332	Resource Requests vs Limits	Which pods are over-provisioned or under-provisioned
7249	Kubernetes Cluster Autoscaler	Node scaling events and decisions
3119	Kubernetes Deployment	Rolling update status and replica health

Bash

# Import a dashboard by ID:
# Grafana UI -> Left sidebar -> Dashboards -> Import
# Enter ID -> Load -> Select Prometheus as data source -> Import

Building a Custom Dashboard for Your Service

Every team at Swiggy or Razorpay has their own namespace-scoped dashboard. Here is how to build one:

Bash

# Step 1 — Create a new dashboard
# Grafana UI -> Dashboards -> New Dashboard -> Add Visualisation
 
# Step 2 — Connect to Prometheus data source
# Select Prometheus from the data source dropdown
 
# Step 3 — Write PromQL queries for your panels

The five most useful panels for any service dashboard:

Bash

# Panel 1 — Request rate (requests per second)
rate(http_requests_total{namespace="production", service="payment-api"}[5m])
 
# Panel 2 — Error rate percentage (5xx responses)
rate(http_requests_total{namespace="production", status=~"5.."}[5m]) /
rate(http_requests_total{namespace="production"}[5m]) * 100
 
# Panel 3 — CPU usage per pod
rate(container_cpu_usage_seconds_total{
  namespace="production",
  pod=~"payment-api-.*"
}[5m]) * 100
 
# Panel 4 — Memory usage per pod (in MB)
container_memory_usage_bytes{
  namespace="production",
  pod=~"payment-api-.*"
} / 1024 / 1024
 
# Panel 5 — Pod restart count (restarts in last hour)
increase(kube_pod_container_status_restarts_total{
  namespace="production"
}[1h]) > 0

Configuring Alerts in Grafana

Grafana can send alerts when a metric crosses a threshold — before users start complaining:

Bash

# In Grafana UI:
# Open any panel -> Edit -> Alert tab -> Create Alert Rule
 
# Example: Alert when error rate exceeds 5%
# Condition: WHEN last() OF query(A, 5m, now) IS ABOVE 5
# Evaluate every: 1m
# For: 2m (alert only fires if condition holds for 2 minutes)
# Notification: Send to Slack #prod-alerts channel

YAML

# Or define alerts as code using PrometheusRule CRD
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-api-alerts
  namespace: production
spec:
  groups:
    * name: payment-api
      rules:
        * alert: HighErrorRate
          expr: |
            rate(http_requests_total{status=~"5..", service="payment-api"}[5m]) /
            rate(http_requests_total{service="payment-api"}[5m]) > 0.05
          for: 2m
          labels:
            severity: critical
            team: payments
          annotations:
            summary: "Payment API error rate above 5%"
            description: "Error rate is {{ $value | humanizePercentage }}"

Grafana Key Concepts

Concept	What It Is
Dashboard	A collection of panels showing related metrics
Panel	A single chart, gauge, table, or stat widget
Data Source	Where Grafana fetches data from (Prometheus, Loki, etc.)
Variable	A dropdown filter that changes all panels at once (e.g. namespace selector)
Annotation	A vertical line on a chart marking a deployment or incident
Folder	Organises dashboards by team or service

Using Dashboard Variables for Multi-Service Views

Variables let one dashboard serve all namespaces:

Bash

# In Dashboard Settings -> Variables -> Add Variable:
# Name: namespace
# Type: Query
# Query: label_values(kube_pod_info, namespace)
# This creates a dropdown of all namespaces
 
# Now use $namespace in all PromQL queries:
rate(http_requests_total{namespace="$namespace"}[5m])
# Engineers select their namespace from the dropdown
# All panels filter to that namespace automatically

Troubleshooting Common Grafana Issues

Problem	Likely Cause	Fix
No data on panels	Prometheus data source not configured	Settings -> Data Sources -> Add Prometheus -> URL: `http://prometheus:9090`
Dashboard shows N/A	PromQL query returns no results	Check metric name with `kubectl exec -it <prometheus-pod> -- promtool query instant`
Panels not refreshing	Auto-refresh not set	Dashboard top right -> set refresh interval to 30s
Alert not firing	Alertmanager not connected	Check Grafana -> Alerting -> Contact Points are configured
Dashboard loads slowly	Too many panels or short time range	Reduce panel count or increase minimum step interval

Bash

# Check Grafana logs for data source connection errors
kubectl logs -n monitoring \
  -l app.kubernetes.io/name=grafana \
  -c grafana --tail=50
 
# Restart Grafana if panels stop loading
kubectl rollout restart deployment/kube-prometheus-stack-grafana \
  -n monitoring

PLACEMENT PRO TIP
**Tip:** Create a dedicated Grafana folder per team — Payments, Orders, Delivery, Platform. Each squad owns their dashboards. Link the primary dashboard URL directly in the team runbook so engineers hit the right dashboard in under 10 seconds during an incident. At Swiggy scale, finding the right dashboard under pressure costs precious MTTR minutes.

REMEMBER THIS
**Remember:** Grafana dashboards do not store data — they only visualise what is in Prometheus. If Prometheus has 15 days of retention and you need to investigate an incident from 20 days ago, the data is gone. Set Prometheus retention to at least 30 days and consider long-term storage with Thanos or Grafana Mimir for compliance and capacity planning.

COMMON MISTAKE / WARNING
**Common Mistake:** Using the same Grafana dashboard for all environments (dev, staging, production) without namespace variables. Engineers accidentally look at staging metrics while debugging a production incident. Always use dashboard variables so the namespace is explicit — and set the default variable value to `production` so the dangerous environment is always the default view.

COMMON MISTAKE / WARNING
**Security:** Grafana's admin account should never be used by individual engineers. Create organisation-level users with viewer or editor roles. On Razorpay-scale platforms, connect Grafana to your SSO provider (Okta, Google Workspace) so access is controlled by the same identity system as everything else — and revoked automatically when engineers leave the company.