What is Prometheus? | DevOps Dictionary

Prometheus — The Metrics Engine of Every Kubernetes Cluster

What is Prometheus in Simple Terms?

Before Prometheus, monitoring meant applications pushing metrics to a central server. Prometheus flipped this — it pulls (scrapes) metrics from your applications on a schedule. Every 15 seconds, Prometheus visits every configured target, fetches its /metrics endpoint, and stores the numbers in its time-series database.

At Zerodha, Prometheus scrapes 500+ pods every 15 seconds — collecting CPU usage, request latency, order processing rates, and database query times. This data powers the Grafana dashboards that SREs watch during market hours and the alerts that wake engineers up when error rates spike.

How Prometheus Works — The Pull Model

◈ DIAGRAM

+------------------------------------------+
| Prometheus (every 15 seconds)            |
+------------------------------------------+
          |           |           |
          | HTTP GET  | HTTP GET  | HTTP GET
          | /metrics  | /metrics  | /metrics
          v           v           v
+----------+  +----------+  +-------------------+
| Pod A    |  | Pod B    |  | Node Exporter     |
| payment  |  | order    |  | (on each node)    |
| :8080    |  | :8080    |  | :9100             |
| /metrics |  | /metrics |  | /metrics          |
+----------+  +----------+  +-------------------+
 
Each /metrics endpoint returns:
http_requests_total{method="GET", status="200"} 47832
http_requests_total{method="POST", status="500"} 12
process_memory_bytes 268435456

Installing Prometheus on Kubernetes

The standard installation is through the kube-prometheus-stack Helm chart which installs the full observability stack:

Bash

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update
 
kubectl create namespace monitoring
 
helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=gp3 \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi
 
# Verify all components are running
kubectl get pods -n monitoring
# prometheus-kube-prometheus-stack-prometheus-0    2/2  Running
# alertmanager-kube-prometheus-stack-alertmanager-0  2/2  Running
# kube-prometheus-stack-grafana-xxx               3/3  Running
# kube-prometheus-stack-kube-state-metrics-xxx    1/1  Running
# kube-prometheus-stack-node-exporter-xxx         2/2  Running (one per node)
 
# Access the Prometheus web UI
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
# Open http://localhost:9090

Key Components That Come With the Stack

◈ DIAGRAM

+------------------------------------------+
| kube-state-metrics                       |
| Exposes Kubernetes object state as       |
| metrics: pod counts, deployment health,  |
| node conditions, PVC status              |
| Target: /metrics on port 8080            |
+------------------------------------------+
 
+------------------------------------------+
| Node Exporter (DaemonSet)                |
| Exposes host-level metrics from every    |
| node: CPU, memory, disk I/O, network,    |
| filesystem usage                         |
| Target: /metrics on port 9100 per node   |
+------------------------------------------+
 
+------------------------------------------+
| cAdvisor (built into kubelet)            |
| Exposes container-level metrics:         |
| container CPU, memory, filesystem        |
| Target: kubelet /metrics/cadvisor        |
+------------------------------------------+

Configuring Prometheus to Scrape Your Application

Prometheus uses ServiceMonitor CRD objects to know which services to scrape. You do not edit Prometheus config directly:

YAML

# Step 1 — Add /metrics endpoint to your application
# Example: Node.js with prom-client library
 
# In your Node.js app:
# const client = require('prom-client')
# const register = new client.Registry()
# client.collectDefaultMetrics({ register })
# app.get('/metrics', async (req, res) => {
#   res.set('Content-Type', register.contentType)
#   res.send(await register.metrics())
# })
 
# Step 2 — Create a ServiceMonitor to tell Prometheus about your service
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payment-api
  namespace: production
  labels:
    release: kube-prometheus-stack  # Must match Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: payment-api              # Selects your Service
  endpoints:
    * port: http                    # Named port on the Service
      path: /metrics                # Where metrics are exposed
      interval: 15s                 # Scrape frequency
      scrapeTimeout: 10s            # Timeout per scrape
  namespaceSelector:
    matchNames:
      * production

Bash

# Verify Prometheus discovered your target
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
# Open http://localhost:9090/targets
# Your service should appear with state=UP
 
# If it shows DOWN, check:
# 1. ServiceMonitor label matches Prometheus serviceMonitorSelector
# 2. Service port name matches ServiceMonitor endpoint port
# 3. /metrics endpoint returns valid Prometheus format

Prometheus Storage and Retention

Bash

# Check current storage usage
kubectl exec -it prometheus-kube-prometheus-stack-prometheus-0 \
  -n monitoring -- \
  df -h /prometheus
 
# Check retention configuration
kubectl get prometheus kube-prometheus-stack-prometheus \
  -n monitoring -o yaml | grep retention
# retentionSize: 80GB  (delete oldest data when storage hits 80GB)
# retention: 30d        (delete data older than 30 days)
 
# Compact and defrag the database if it becomes slow
kubectl exec -it prometheus-kube-prometheus-stack-prometheus-0 \
  -n monitoring -- \
  promtool tsdb analyze /prometheus

Essential Prometheus Operational Commands

Bash

# Check which targets Prometheus is scraping (and their health)
# http://localhost:9090/targets after port-forward
 
# Check active alert rules
kubectl get prometheusrule -A
 
# View Prometheus configuration
kubectl get secret prometheus-kube-prometheus-stack-prometheus \
  -n monitoring -o jsonpath='{.data.prometheus\.yaml\.gz}' | \
  base64 -d | gunzip
 
# Force Prometheus to reload configuration
kubectl exec -it prometheus-kube-prometheus-stack-prometheus-0 \
  -n monitoring -- \
  kill -HUP 1
 
# Check Prometheus logs for scrape errors
kubectl logs prometheus-kube-prometheus-stack-prometheus-0 \
  -n monitoring -c prometheus --tail=50 | grep -i error

Troubleshooting Common Prometheus Issues

Problem	Likely Cause	Fix
Target shows DOWN	Pod not exposing /metrics correctly	`kubectl exec -it <pod> -- curl localhost:8080/metrics`
ServiceMonitor not discovered	Label selector mismatch	Check ServiceMonitor has `release: kube-prometheus-stack` label
High memory usage on Prometheus	Too many high-cardinality metrics	Check for label explosion: `kubectl exec prometheus -- promtool tsdb analyze /prometheus`
Slow queries in Grafana	Prometheus not enough memory	Increase `prometheusSpec.resources.limits.memory` in Helm values
Missing metrics after pod restart	Scrape interval too long	Reduce `interval: 15s` — metrics lost between restarts are normal

REMEMBER THIS
**Remember:** Prometheus stores data locally — it is not replicated or distributed by default. If the Prometheus pod is deleted or its persistent volume is lost, all historical metrics are gone. Always use `storageSpec` with a PersistentVolumeClaim and `reclaimPolicy: Retain` so the data survives pod restarts and accidental PVC deletion.

PLACEMENT PRO TIP
**Tip:** Avoid high-cardinality labels in your custom metrics. A label that has thousands of unique values (like a user ID or request ID) creates thousands of separate time series — each consuming storage and memory in Prometheus. Good labels have low cardinality: `status` (5-10 values), `method` (5-10 values), `service` (tens of values). Bad labels: `user_id`, `request_id`, `session_token`.

COMMON MISTAKE / WARNING
**Common Mistake:** Using Prometheus for long-term storage beyond 30 days. Prometheus is designed for short-to-medium term storage. For compliance requirements, capacity planning, or year-over-year comparisons at Razorpay or Hotstar scale, add Thanos or Grafana Mimir as a long-term storage backend. These write Prometheus data to object storage (S3) for unlimited retention at low cost.

COMMON MISTAKE / WARNING
**Security:** The Prometheus web UI at port 9090 and the metrics endpoints on your pods should never be exposed to the public internet. Metrics data reveals internal service names, error patterns, and infrastructure topology that attackers can use for reconnaissance. Always firewall the port-forward to localhost and use Kubernetes NetworkPolicies to restrict which pods can access the Prometheus service.