Alertmanager — Getting the Right Alert to the Right Person at the Right Time
What is Alertmanager in Simple Terms?
Prometheus detects problems — it watches metrics and fires an alert when a threshold is crossed. But Prometheus itself cannot send a Slack message or wake someone up on PagerDuty. That is Alertmanager's job.
Alertmanager sits between Prometheus and your team. It receives all firing alerts, groups related ones together (so a node failure that affects 50 pods sends ONE notification, not 50), routes different severity alerts to different destinations, and silences alerts during planned maintenance windows.
+------------------------------------------+| Prometheus || Evaluates alert rules every 1 minute || Fires alert when condition is true |+------------------------------------------+ | | HTTP POST (alert payload) v+------------------------------------------+| Alertmanager || || 1. Receives alert from Prometheus || 2. Groups related alerts together || 3. Waits group_wait (30s) for more || 4. Routes based on labels || 5. Sends to correct receiver || 6. Repeats until alert resolves |+------------------------------------------+ | | v v+------------------+ +------------------+| Slack | | PagerDuty || #prod-warnings | | On-call engineer || (severity=warn) | | (severity=crit) |+------------------+ +------------------+Installing Alertmanager
Alertmanager is installed automatically by the kube-prometheus-stack Helm chart:
# Verify Alertmanager is runningkubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager# NAME READY STATUS# alertmanager-kube-prometheus-stack-0 2/2 Running # Access the Alertmanager web UIkubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring# Open http://localhost:9093# Shows all currently firing alerts and their stateUnderstanding the Core Configuration Concepts
+------------------------------------------+| Route || Defines how alerts are matched and || directed to receivers || Like an if-else decision tree for alerts |+------------------------------------------+ +------------------------------------------+| Receiver || A named destination — Slack channel, || PagerDuty service, email address, or || any webhook endpoint |+------------------------------------------+ +------------------------------------------+| Inhibition Rule || Suppresses certain alerts when a more || severe alert is already firing || e.g. suppress pod alerts when whole || node is already alerting |+------------------------------------------+ +------------------------------------------+| Silence || Temporarily mutes matching alerts || Used during planned maintenance windows |+------------------------------------------+A Complete Production Alertmanager Configuration
# alertmanager.yaml — applied via Kubernetes SecretapiVersion: v1kind: Secretmetadata: name: alertmanager-kube-prometheus-stack-alertmanager namespace: monitoringtype: OpaquestringData: alertmanager.yaml: | global: resolve_timeout: 5m # Default Slack webhook for all receivers slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxx' # -- Routing Tree -------------------------------------------- route: # Default receiver if no child route matches receiver: 'slack-warnings' # Group alerts by these labels # All alerts with same alertname + namespace go to one notification group_by: ['alertname', 'namespace', 'severity'] # Wait 30s after first alert fires before sending # Allows grouping of related alerts that fire within 30s group_wait: 30s # Wait 5 minutes before sending a NEW group notification group_interval: 5m # Resend an unresolved alert every 12 hours repeat_interval: 12h # Child routes — checked in order, first match wins routes: # Critical alerts -> wake someone up via PagerDuty * match: severity: critical receiver: pagerduty-production group_wait: 10s # Send critical alerts faster repeat_interval: 1h # Repeat every hour until resolved # Payment team alerts -> their own Slack channel * match_re: namespace: 'payments.*' receiver: slack-payments-team # Warning alerts -> general ops channel * match: severity: warning receiver: slack-warnings # -- Receivers ----------------------------------------------- receivers: * name: 'slack-warnings' slack_configs: * channel: '#prod-warnings' send_resolved: true # Send green recovery message title: | [{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} text: | *Namespace:* {{ .GroupLabels.namespace }} *Severity:* {{ .GroupLabels.severity }} {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} {{ end }} * name: 'slack-payments-team' slack_configs: * channel: '#payments-alerts' send_resolved: true title: 'Payments Alert: {{ .GroupLabels.alertname }}' * name: 'pagerduty-production' pagerduty_configs: * service_key: '<your-pagerduty-integration-key>' description: '{{ .GroupLabels.alertname }} in {{ .GroupLabels.namespace }}' # -- Inhibition Rules ----------------------------------------- inhibit_rules: # If a node is down, suppress individual pod alerts from that node # Prevents 50 pod alerts flooding Slack when the real issue is 1 node * source_match: alertname: 'NodeDown' target_match_re: alertname: 'Pod.*' equal: ['node']Writing PrometheusRule Alert Rules
# prometheusrule.yaml — defines what Prometheus monitors and alerts onapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: production-alerts namespace: monitoring labels: release: kube-prometheus-stack # Must match Prometheus ruleSelectorspec: groups: * name: pod-health interval: 1m # Evaluate these rules every 1 minute rules: # Alert when a pod keeps restarting * alert: PodCrashLooping expr: | rate(kube_pod_container_status_restarts_total{ namespace="production" }[15m]) * 900 > 3 for: 5m # Only fire if restarting for 5 continuous minutes labels: severity: critical team: platform annotations: summary: "Pod {{ $labels.pod }} is crash looping" description: | Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted {{ $value | humanize }} times in 15 minutes. Check logs: kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous # Alert when memory usage is above 85% of limit * alert: PodMemoryNearLimit expr: | container_memory_working_set_bytes{namespace="production"} / container_spec_memory_limit_bytes{namespace="production"} > 0.85 for: 10m labels: severity: warning annotations: summary: "Pod {{ $labels.pod }} memory above 85% of limit" description: "Memory at {{ $value | humanizePercentage }}" # Alert when error rate exceeds 5% * alert: HighErrorRate expr: | sum by (service) ( rate(http_requests_total{status=~"5..", namespace="production"}[5m]) ) / sum by (service) ( rate(http_requests_total{namespace="production"}[5m]) ) > 0.05 for: 2m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}"Managing Silences During Maintenance
# Create a silence via Alertmanager UI at http://localhost:9093# Or use amtool CLI: # Install amtoolgo install github.com/prometheus/alertmanager/cmd/amtool@latest # Silence all alerts in the payments namespace during maintenanceamtool silence add \ --alertmanager.url=http://localhost:9093 \ --comment="Planned maintenance window 02:00-04:00" \ --duration=2h \ namespace="payments-prod" # List all active silencesamtool silence query --alertmanager.url=http://localhost:9093 # Expire a silence earlyamtool silence expire --alertmanager.url=http://localhost:9093 <silence-id>Troubleshooting Alertmanager
| Problem | Likely Cause | Fix |
|---|---|---|
| Alerts firing but no Slack message | Receiver misconfigured | Check amtool config routes test — trace which receiver is selected |
| Too many duplicate notifications | group_by not set |
Add group_by: ['alertname', 'namespace'] to the route |
| Alert fires and resolves repeatedly | for duration too short |
Increase for: 5m on the alert rule to require sustained condition |
| Silence not working | Labels don't match alert labels | Check alert labels exactly with amtool alert query |
| PagerDuty not receiving | Wrong integration key | Verify service key in receiver config matches PagerDuty service |
# Check Alertmanager logs for routing and delivery errorskubectl logs -n monitoring \ alertmanager-kube-prometheus-stack-alertmanager-0 \ -c alertmanager --tail=50 # View all currently firing alertskubectl port-forward svc/alertmanager-operated 9093:9093 -n monitoring# Open http://localhost:9093/#/alerts # Test your routing configurationamtool config routes test \ --alertmanager.url=http://localhost:9093 \ alertname=HighErrorRate namespace=production severity=critical# Shows which receiver would handle this alertCOMMON MISTAKE / WARNING**Common Mistake:** Not setting `group_wait` and relying on defaults. Without explicit grouping configuration, a single node failure that takes down 30 pods will fire 30 separate Slack messages simultaneously — completely flooding the channel. Always set `group_by: ['alertname', 'namespace']` and `group_wait: 30s` so related alerts are batched into a single notification.
PLACEMENT PRO TIP**Tip:** Write alert annotations that include the exact kubectl command engineers need to start debugging. The `description` field should say `kubectl logs {{ $labels.pod }} -n {{ $labels.namespace }} --previous` rather than just `pod is crashing`. At 3am during a Zerodha production incident, engineers should be able to copy the command from the Slack notification and run it immediately without thinking.
REMEMBER THIS**Remember:** Alertmanager deduplicates alerts by their label set. If Prometheus fires the same alert with identical labels twice (which it does when a condition is persistently true), Alertmanager sends only ONE notification — not two. This is intentional. The `repeat_interval` controls when the reminder notification is sent for still-unresolved alerts.
COMMON MISTAKE / WARNING**Security:** Never put Slack webhook URLs, PagerDuty keys, or any credentials directly in your Alertmanager ConfigMap that gets committed to Git. Store credentials in a Kubernetes Secret and reference it in the Alertmanager configuration. On a shared Razorpay cluster, the Alertmanager ConfigMap is visible to anyone with `kubectl get configmap` access — credentials must be in Secrets, not ConfigMaps.