Prometheus and PromQL - Query Quick Reference

PromQL query reference — selectors, rate functions, aggregations, alerting rules, and the queries engineers actually run during incidents.

Prometheus and PromQL — Query Quick Reference

Basic Selectors

Pull raw time series before doing any math on them.

TEXT

up                                      ## 1 if target is healthy, 0 if down
up{job="orders-api"}                    ## scoped to one job
http_requests_total{job="orders-api", status="500"}
http_requests_total{job=~"orders-.*"}   ## regex match on label value
http_requests_total{job!="orders-api"}  ## negative match

Parameter Breakdown:

{label="value"}: Exact match filter on a metric's labels
=~: Regex match — useful for matching a family of related services
!= / !~: Negative exact match and negative regex match

Range Vectors and Rate

The functions you'll use in nearly every dashboard panel.

TEXT

http_requests_total[5m]                 ## raw samples over the last 5 minutes
rate(http_requests_total[5m])           ## per-second average rate, counters
irate(http_requests_total[5m])          ## instant rate, last two data points only
increase(http_requests_total[1h])       ## total increase over the window
delta(cpu_temp_celsius[10m])            ## change over time, for gauges not counters

Command Parameter Table:

Function	Use Case
`rate()`	Smoothed per-second rate — best for alerting and dashboards
`irate()`	Spiky, reacts fast — best for short-term debugging, not alerting
`increase()`	Total count over a window — best for "how many errors in the last hour"

Aggregation Operators

Roll many time series into one meaningful number.

TEXT

sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m])) by (job)
sum(rate(http_requests_total[5m])) by (job, status)
avg(node_memory_available_bytes) by (instance)
max(node_cpu_seconds_total) by (instance)
count(up == 0)                          ## how many targets are currently down
topk(5, sum(rate(http_requests_total[5m])) by (job))

Parameter Breakdown:

by (label): Groups the aggregation, keeping that label instead of collapsing everything
topk(N, ...): Returns the N highest values — useful for "which services have the most traffic"
count(up == 0): A common alerting building block for "how many targets are unhealthy"

Common CPU, Memory, and Disk Queries

Patterns pulled from node_exporter and cAdvisor metrics.

TEXT

## CPU usage percentage per instance
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

## Memory usage percentage
(1 - (node_memory_available_bytes / node_memory_total_bytes)) * 100

## Disk usage percentage per mount point
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

## Container memory usage (cAdvisor)
sum(container_memory_usage_bytes{pod="orders-api-7d4b"}) by (pod)

Notes:

node_cpu_seconds_total is a counter — always wrap it in rate() or irate() before using it directly
Memory queries divide available by total, then invert, since Prometheus tracks available, not used, by default

Kubernetes-Specific Queries

Queries that depend on kube-state-metrics being scraped.

TEXT

## Pods not in Running phase
kube_pod_status_phase{phase!="Running"} == 1

## Pod restart count over the last hour
increase(kube_pod_container_status_restarts_total[1h]) > 0

## Deployment desired vs available replica mismatch
kube_deployment_spec_replicas != kube_deployment_status_replicas_available

## Node memory pressure condition
kube_node_status_condition{condition="MemoryPressure", status="true"} == 1

Parameter Breakdown:

kube_state_metrics: Required alongside node_exporter for any pod/deployment-level query
!= kube_deployment_status_replicas_available: The standard pattern for "deployment is degraded"
Always pair these with by (namespace, deployment) in real dashboards to scope results

Alerting Rules Syntax

Define when Prometheus should fire an alert.

YAML

groups:
  - name: orders-api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{job="orders-api",status="500"}[5m]))
          /
          sum(rate(http_requests_total{job="orders-api"}[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Orders API error rate above 5%"
          description: "Error rate is {{ $value | humanizePercentage }}"

Parameter Breakdown:

for: 5m: The condition must stay true for this entire duration before the alert actually fires
labels.severity: Routes the alert to the right Alertmanager receiver
{{ $value }}: Template variable injecting the evaluated query result into the alert message

Recording Rules

Precompute expensive queries so dashboards stay fast.

YAML

groups:
  - name: orders-api-recording-rules
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)
      - record: job:http_error_rate:ratio5m
        expr: |
          sum(rate(http_requests_total{status="500"}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

Parameter Breakdown:

Naming convention level:metric:operation (e.g. job:http_requests:rate5m) keeps recorded metrics self-documenting
Recording rules run on Prometheus's own evaluation interval, not per dashboard load
Reference a recording rule like any other metric: job:http_error_rate:ratio5m{job="orders-api"}

amtool — Alertmanager CLI

Inspect and silence alerts from the command line.

BASH

amtool alert query                      ## list current active alerts
amtool alert query alertname="HighErrorRate"
amtool silence add alertname="HighErrorRate" \
  --duration=2h --comment="Known issue, fix in progress"
amtool silence query                    ## list active silences
amtool silence expire <silence-id>      ## end a silence early
amtool config show                      ## dump active Alertmanager config

Parameter Breakdown:

silence add --duration: Suppresses notifications without disabling the alert rule itself
--comment: Required and shown in the UI — always explain why a silence exists
silence expire: Use once the underlying issue is actually fixed, don't let silences linger

Useful Functions

Functions that come up constantly once you're past basic queries.

TEXT

histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

predict_linear(node_filesystem_avail_bytes[6h], 4 * 3600) < 0

label_replace(up, "service", "$1", "job", "(.*)-api")

absent(up{job="orders-api"})            ## true if the target has vanished entirely

Command Parameter Table:

Function	Description
`histogram_quantile()`	Computes a percentile (e.g. p95 latency) from a histogram metric
`predict_linear()`	Forecasts a value N seconds ahead based on recent trend — great for disk-full alerts
`absent()`	True when a metric or target stops reporting entirely, not just hits zero