What is Argo Rollouts? | DevOps Dictionary

Understanding Argo Rollouts

What Is Argo Rollouts in Simple Terms

Argo Rollouts gives Kubernetes Deployments a brain. A normal Kubernetes Deployment does a rolling update — it gradually replaces old pods with new ones, and if new pods crash, it stops. But it cannot check whether the new version has elevated error rates, increased latency, or degraded business metrics. It just checks if pods are running.

Argo Rollouts adds that intelligence: deploy the new version to 10% of traffic, check Prometheus for error rates over the next 5 minutes, if the error rate is below 1% promote to 25%, then 50%, then 100%. If the error rate spikes, roll back automatically. No human needed for the happy path, automatic safety net for the failure path.

How It Works

◈ DIAGRAM

+------------------------------------------+
| New version deployed to canary (10%)     |
| 90% traffic still on stable version      |
+------------------------------------------+
                    |
              analysis step
              check Prometheus:
              error_rate < 1%?
             /              \
           yes               no
            |                 |
            v                 v
+------------------+  +------------------+
| Promote to 25%   |  | ABORT            |
| continue canary  |  | Rollback to      |
| analysis         |  | stable version   |
+------------------+  +------------------+
            |
            v
+------------------------------------------+
| Promote to 50% -> 100%                   |
| All traffic on new version               |
| Stable version scaled down               |
+------------------------------------------+

Rollout manifest with canary strategy:

YAML

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-api
spec:
  replicas: 10
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      containers:
        - name: payment-api
          image: payment-api:v1.2.3
 
  strategy:
    canary:
      ## Canary steps with analysis
      steps:
        - setWeight: 10      ## 10% of traffic to canary
        - analysis:
            templates:
              - templateName: payment-api-error-rate
            args:
              - name: service-name
                value: payment-api-canary
        - setWeight: 25
        - pause: {duration: 5m}  ## wait 5 minutes
        - setWeight: 50
        - pause: {duration: 5m}
        - setWeight: 100
 
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: payment-api-error-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 1m
      count: 5       ## measure 5 times
      successCondition: result[0] < 0.01  ## below 1% error rate
      failureLimit: 2  ## fail after 2 bad measurements
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}",
            status=~"5.."}[2m])) /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))

Argo Rollouts kubectl plugin:

Bash

## Install plugin
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
 
## Watch rollout progress live
kubectl argo rollouts get rollout payment-api --watch
 
## Manually promote a paused canary
kubectl argo rollouts promote payment-api
 
## Abort a rollout (triggers rollback)
kubectl argo rollouts abort payment-api
 
## Manually retry an aborted rollout
kubectl argo rollouts retry rollout payment-api
 
## Set image to trigger a new rollout
kubectl argo rollouts set image payment-api \
  payment-api=payment-api:v1.2.4

Troubleshooting

Symptom	Check	What to Look For
Analysis always failing	Check Prometheus query	Query returning wrong metric or no data
Rollout stuck at pause	Check analysis results	Manual promotion may be needed
Canary not receiving traffic	Check service selector	Canary service label matching
Old pods not scaling down	Check HPA configuration	HPA may conflict with rollout replicas

PLACEMENT PRO TIP
**Tip:** Start with a simple canary without analysis — just `setWeight` steps and manual `pause` periods. Get comfortable with the rollout lifecycle before adding automated Prometheus analysis. A manually controlled canary is far safer than a broken automated analysis that always passes.

REMEMBER THIS
**Remember:** Argo Rollouts replaces the Kubernetes Deployment resource — you use `kind: Rollout` instead of `kind: Deployment`. Existing Deployments can be converted, but this requires a migration step. For new services, deploy as Rollouts from the start.

COMMON MISTAKE / WARNING
**Security:** Analysis templates that query Prometheus must be scoped carefully. A poorly written PromQL query that returns no data (empty result) will cause the analysis to pass by default — meaning broken Prometheus monitoring silently disables your safety gate. Always test analysis templates with synthetic load before relying on them in production.

COMMON MISTAKE / WARNING
**Common Mistake:** Setting canary weight too high for the first step. Starting at 50% canary means half your production traffic is on an untested version. Start at 5-10% for the first canary step. The purpose of the initial step is to expose the new version to a small, representative traffic sample — not to immediately share the load.