PromQL query reference — selectors, rate functions, aggregations, alerting rules, and the queries engineers actually run during incidents.
Pull raw time series before doing any math on them.
up ## 1 if target is healthy, 0 if down
up{job="orders-api"} ## scoped to one job
http_requests_total{job="orders-api", status="500"}
http_requests_total{job=~"orders-.*"} ## regex match on label value
http_requests_total{job!="orders-api"} ## negative match
Parameter Breakdown:
{label="value"}: Exact match filter on a metric's labels=~: Regex match — useful for matching a family of related services!= / !~: Negative exact match and negative regex matchThe functions you'll use in nearly every dashboard panel.
http_requests_total[5m] ## raw samples over the last 5 minutes
rate(http_requests_total[5m]) ## per-second average rate, counters
irate(http_requests_total[5m]) ## instant rate, last two data points only
increase(http_requests_total[1h]) ## total increase over the window
delta(cpu_temp_celsius[10m]) ## change over time, for gauges not counters
Command Parameter Table:
| Function | Use Case |
|---|---|
rate() |
Smoothed per-second rate — best for alerting and dashboards |
irate() |
Spiky, reacts fast — best for short-term debugging, not alerting |
increase() |
Total count over a window — best for "how many errors in the last hour" |
Roll many time series into one meaningful number.
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m])) by (job)
sum(rate(http_requests_total[5m])) by (job, status)
avg(node_memory_available_bytes) by (instance)
max(node_cpu_seconds_total) by (instance)
count(up == 0) ## how many targets are currently down
topk(5, sum(rate(http_requests_total[5m])) by (job))
Parameter Breakdown:
by (label): Groups the aggregation, keeping that label instead of collapsing everythingtopk(N, ...): Returns the N highest values — useful for "which services have the most traffic"count(up == 0): A common alerting building block for "how many targets are unhealthy"Patterns pulled from node_exporter and cAdvisor metrics.
## CPU usage percentage per instance
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
## Memory usage percentage
(1 - (node_memory_available_bytes / node_memory_total_bytes)) * 100
## Disk usage percentage per mount point
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
## Container memory usage (cAdvisor)
sum(container_memory_usage_bytes{pod="orders-api-7d4b"}) by (pod)
Notes:
node_cpu_seconds_total is a counter — always wrap it in rate() or irate() before using it directlyQueries that depend on kube-state-metrics being scraped.
## Pods not in Running phase
kube_pod_status_phase{phase!="Running"} == 1
## Pod restart count over the last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0
## Deployment desired vs available replica mismatch
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
## Node memory pressure condition
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1
Parameter Breakdown:
kube_state_metrics: Required alongside node_exporter for any pod/deployment-level query!= kube_deployment_status_replicas_available: The standard pattern for "deployment is degraded"by (namespace, deployment) in real dashboards to scope resultsDefine when Prometheus should fire an alert.
groups:
- name: orders-api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{job="orders-api",status="500"}[5m]))
/
sum(rate(http_requests_total{job="orders-api"}[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Orders API error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
Parameter Breakdown:
for: 5m: The condition must stay true for this entire duration before the alert actually fireslabels.severity: Routes the alert to the right Alertmanager receiver{{ $value }}: Template variable injecting the evaluated query result into the alert messagePrecompute expensive queries so dashboards stay fast.
groups:
- name: orders-api-recording-rules
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_error_rate:ratio5m
expr: |
sum(rate(http_requests_total{status="500"}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Parameter Breakdown:
level:metric:operation (e.g. job:http_requests:rate5m) keeps recorded metrics self-documentingjob:http_error_rate:ratio5m{job="orders-api"}Inspect and silence alerts from the command line.
amtool alert query ## list current active alerts
amtool alert query alertname="HighErrorRate"
amtool silence add alertname="HighErrorRate" \
--duration=2h --comment="Known issue, fix in progress"
amtool silence query ## list active silences
amtool silence expire <silence-id> ## end a silence early
amtool config show ## dump active Alertmanager config
Parameter Breakdown:
silence add --duration: Suppresses notifications without disabling the alert rule itself--comment: Required and shown in the UI — always explain why a silence existssilence expire: Use once the underlying issue is actually fixed, don't let silences lingerFunctions that come up constantly once you're past basic queries.
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0
label_replace(up, "service", "$1", "job", "(.*)-api")
absent(up{job="orders-api"}) ## true if the target has vanished entirely
Command Parameter Table:
| Function | Description |
|---|---|
histogram_quantile() |
Computes a percentile (e.g. p95 latency) from a histogram metric |
predict_linear() |
Forecasts a value N seconds ahead based on recent trend — great for disk-full alerts |
absent() |
True when a metric or target stops reporting entirely, not just hits zero |