What is Alertmanager? | DevOps Dictionary

Alertmanager — Getting the Right Alert to the Right Person at the Right Time

What is Alertmanager in Simple Terms?

Prometheus detects problems — it watches metrics and fires an alert when a threshold is crossed. But Prometheus itself cannot send a Slack message or wake someone up on PagerDuty. That is Alertmanager's job.

Alertmanager sits between Prometheus and your team. It receives all firing alerts, groups related ones together (so a node failure that affects 50 pods sends ONE notification, not 50), routes different severity alerts to different destinations, and silences alerts during planned maintenance windows.

◈ DIAGRAM

+------------------------------------------+
| Prometheus                               |
| Evaluates alert rules every 1 minute     |
| Fires alert when condition is true       |
+------------------------------------------+
                    |
                    | HTTP POST (alert payload)
                    v
+------------------------------------------+
| Alertmanager                             |
|                                          |
| 1. Receives alert from Prometheus        |
| 2. Groups related alerts together        |
| 3. Waits group_wait (30s) for more       |
| 4. Routes based on labels                |
| 5. Sends to correct receiver             |
| 6. Repeats until alert resolves          |
+------------------------------------------+
          |                   |
          v                   v
+------------------+  +------------------+
| Slack            |  | PagerDuty        |
| #prod-warnings   |  | On-call engineer |
| (severity=warn)  |  | (severity=crit)  |
+------------------+  +------------------+

Installing Alertmanager

Alertmanager is installed automatically by the kube-prometheus-stack Helm chart:

Bash

# Verify Alertmanager is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# NAME                                        READY   STATUS
# alertmanager-kube-prometheus-stack-0        2/2     Running
 
# Access the Alertmanager web UI
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring
# Open http://localhost:9093
# Shows all currently firing alerts and their state

Understanding the Core Configuration Concepts

◈ DIAGRAM

+------------------------------------------+
| Route                                    |
| Defines how alerts are matched and       |
| directed to receivers                    |
| Like an if-else decision tree for alerts |
+------------------------------------------+
 
+------------------------------------------+
| Receiver                                 |
| A named destination — Slack channel,     |
| PagerDuty service, email address, or     |
| any webhook endpoint                     |
+------------------------------------------+
 
+------------------------------------------+
| Inhibition Rule                          |
| Suppresses certain alerts when a more    |
| severe alert is already firing           |
| e.g. suppress pod alerts when whole      |
| node is already alerting                 |
+------------------------------------------+
 
+------------------------------------------+
| Silence                                  |
| Temporarily mutes matching alerts        |
| Used during planned maintenance windows  |
+------------------------------------------+

A Complete Production Alertmanager Configuration

YAML

# alertmanager.yaml — applied via Kubernetes Secret
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-kube-prometheus-stack-alertmanager
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      # Default Slack webhook for all receivers
      slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxx'
    # -- Routing Tree --------------------------------------------
    route:
      # Default receiver if no child route matches
      receiver: 'slack-warnings'
 
      # Group alerts by these labels
      # All alerts with same alertname + namespace go to one notification
      group_by: ['alertname', 'namespace', 'severity']
 
      # Wait 30s after first alert fires before sending
      # Allows grouping of related alerts that fire within 30s
      group_wait: 30s
 
      # Wait 5 minutes before sending a NEW group notification
      group_interval: 5m
 
      # Resend an unresolved alert every 12 hours
      repeat_interval: 12h
 
      # Child routes — checked in order, first match wins
      routes:
        # Critical alerts -> wake someone up via PagerDuty
        * match:
            severity: critical
          receiver: pagerduty-production
          group_wait: 10s       # Send critical alerts faster
          repeat_interval: 1h  # Repeat every hour until resolved
 
        # Payment team alerts -> their own Slack channel
        * match_re:
            namespace: 'payments.*'
          receiver: slack-payments-team
 
        # Warning alerts -> general ops channel
        * match:
            severity: warning
          receiver: slack-warnings
 
    # -- Receivers -----------------------------------------------
    receivers:
      * name: 'slack-warnings'
        slack_configs:
          * channel: '#prod-warnings'
            send_resolved: true    # Send green recovery message
            title: |
              [{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
            text: |
              *Namespace:* {{ .GroupLabels.namespace }}
              *Severity:* {{ .GroupLabels.severity }}
              {{ range .Alerts }}
              *Alert:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              {{ end }}
      * name: 'slack-payments-team'
        slack_configs:
          * channel: '#payments-alerts'
            send_resolved: true
            title: 'Payments Alert: {{ .GroupLabels.alertname }}'
 
      * name: 'pagerduty-production'
        pagerduty_configs:
          * service_key: '<your-pagerduty-integration-key>'
            description: '{{ .GroupLabels.alertname }} in {{ .GroupLabels.namespace }}'
 
    # -- Inhibition Rules -----------------------------------------
    inhibit_rules:
      # If a node is down, suppress individual pod alerts from that node
      # Prevents 50 pod alerts flooding Slack when the real issue is 1 node
      * source_match:
          alertname: 'NodeDown'
        target_match_re:
          alertname: 'Pod.*'
        equal: ['node']

Writing PrometheusRule Alert Rules

YAML

# prometheusrule.yaml — defines what Prometheus monitors and alerts on
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: production-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack  # Must match Prometheus ruleSelector
spec:
  groups:
    * name: pod-health
      interval: 1m   # Evaluate these rules every 1 minute
      rules:
        # Alert when a pod keeps restarting
        * alert: PodCrashLooping
          expr: |
            rate(kube_pod_container_status_restarts_total{
              namespace="production"
            }[15m]) * 900 > 3
          for: 5m    # Only fire if restarting for 5 continuous minutes
          labels:
            severity: critical
            team: platform
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
            description: |
              Pod {{ $labels.pod }} in namespace {{ $labels.namespace }}
              has restarted {{ $value | humanize }} times in 15 minutes.
              Check logs: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous
        # Alert when memory usage is above 85% of limit
        * alert: PodMemoryNearLimit
          expr: |
            container_memory_working_set_bytes{namespace="production"} /
            container_spec_memory_limit_bytes{namespace="production"} > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} memory above 85% of limit"
            description: "Memory at {{ $value | humanizePercentage }}"
 
        # Alert when error rate exceeds 5%
        * alert: HighErrorRate
          expr: |
            sum by (service) (
              rate(http_requests_total{status=~"5..", namespace="production"}[5m])
            ) /
            sum by (service) (
              rate(http_requests_total{namespace="production"}[5m])
            ) > 0.05
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.service }}"
            description: "Error rate is {{ $value | humanizePercentage }}"

Managing Silences During Maintenance

Bash

# Create a silence via Alertmanager UI at http://localhost:9093
# Or use amtool CLI:
 
# Install amtool
go install github.com/prometheus/alertmanager/cmd/amtool@latest
 
# Silence all alerts in the payments namespace during maintenance
amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --comment="Planned maintenance window 02:00-04:00" \
  --duration=2h \
  namespace="payments-prod"
 
# List all active silences
amtool silence query --alertmanager.url=http://localhost:9093
 
# Expire a silence early
amtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>

Troubleshooting Alertmanager

Problem	Likely Cause	Fix
Alerts firing but no Slack message	Receiver misconfigured	Check `amtool config routes test` — trace which receiver is selected
Too many duplicate notifications	`group_by` not set	Add `group_by: ['alertname', 'namespace']` to the route
Alert fires and resolves repeatedly	`for` duration too short	Increase `for: 5m` on the alert rule to require sustained condition
Silence not working	Labels don't match alert labels	Check alert labels exactly with `amtool alert query`
PagerDuty not receiving	Wrong integration key	Verify service key in receiver config matches PagerDuty service

Bash

# Check Alertmanager logs for routing and delivery errors
kubectl logs -n monitoring \
  alertmanager-kube-prometheus-stack-alertmanager-0 \
  -c alertmanager --tail=50
 
# View all currently firing alerts
kubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring
# Open http://localhost:9093/#/alerts
 
# Test your routing configuration
amtool config routes test \
  --alertmanager.url=http://localhost:9093 \
  alertname=HighErrorRate namespace=production severity=critical
# Shows which receiver would handle this alert

COMMON MISTAKE / WARNING
**Common Mistake:** Not setting `group_wait` and relying on defaults. Without explicit grouping configuration, a single node failure that takes down 30 pods will fire 30 separate Slack messages simultaneously — completely flooding the channel. Always set `group_by: ['alertname', 'namespace']` and `group_wait: 30s` so related alerts are batched into a single notification.

PLACEMENT PRO TIP
**Tip:** Write alert annotations that include the exact kubectl command engineers need to start debugging. The `description` field should say `kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous` rather than just `pod is crashing`. At 3am during a Zerodha production incident, engineers should be able to copy the command from the Slack notification and run it immediately without thinking.

REMEMBER THIS
**Remember:** Alertmanager deduplicates alerts by their label set. If Prometheus fires the same alert with identical labels twice (which it does when a condition is persistently true), Alertmanager sends only ONE notification — not two. This is intentional. The `repeat_interval` controls when the reminder notification is sent for still-unresolved alerts.

COMMON MISTAKE / WARNING
**Security:** Never put Slack webhook URLs, PagerDuty keys, or any credentials directly in your Alertmanager ConfigMap that gets committed to Git. Store credentials in a Kubernetes Secret and reference it in the Alertmanager configuration. On a shared Razorpay cluster, the Alertmanager ConfigMap is visible to anyone with `kubectl get configmap` access — credentials must be in Secrets, not ConfigMaps.