Understanding Argo Rollouts
What Is Argo Rollouts in Simple Terms
Argo Rollouts gives Kubernetes Deployments a brain. A normal Kubernetes Deployment does a rolling update — it gradually replaces old pods with new ones, and if new pods crash, it stops. But it cannot check whether the new version has elevated error rates, increased latency, or degraded business metrics. It just checks if pods are running.
Argo Rollouts adds that intelligence: deploy the new version to 10% of traffic, check Prometheus for error rates over the next 5 minutes, if the error rate is below 1% promote to 25%, then 50%, then 100%. If the error rate spikes, roll back automatically. No human needed for the happy path, automatic safety net for the failure path.
How It Works
+------------------------------------------+| New version deployed to canary (10%) || 90% traffic still on stable version |+------------------------------------------+ | analysis step check Prometheus: error_rate < 1%? / \ yes no | | v v+------------------+ +------------------+| Promote to 25% | | ABORT || continue canary | | Rollback to || analysis | | stable version |+------------------+ +------------------+ | v+------------------------------------------+| Promote to 50% -> 100% || All traffic on new version || Stable version scaled down |+------------------------------------------+Rollout manifest with canary strategy:
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: payment-apispec: replicas: 10 selector: matchLabels: app: payment-api template: metadata: labels: app: payment-api spec: containers: - name: payment-api image: payment-api:v1.2.3 strategy: canary: ## Canary steps with analysis steps: - setWeight: 10 ## 10% of traffic to canary - analysis: templates: - templateName: payment-api-error-rate args: - name: service-name value: payment-api-canary - setWeight: 25 - pause: {duration: 5m} ## wait 5 minutes - setWeight: 50 - pause: {duration: 5m} - setWeight: 100 apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: payment-api-error-ratespec: args: - name: service-name metrics: - name: error-rate interval: 1m count: 5 ## measure 5 times successCondition: result[0] < 0.01 ## below 1% error rate failureLimit: 2 ## fail after 2 bad measurements provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{service="{{args.service-name}}", status=~"5.."}[2m])) / sum(rate(http_requests_total{service="{{args.service-name}}"}[2m]))Argo Rollouts kubectl plugin:
## Install plugincurl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64chmod +x kubectl-argo-rollouts-linux-amd64sudo mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts ## Watch rollout progress livekubectl argo rollouts get rollout payment-api --watch ## Manually promote a paused canarykubectl argo rollouts promote payment-api ## Abort a rollout (triggers rollback)kubectl argo rollouts abort payment-api ## Manually retry an aborted rolloutkubectl argo rollouts retry rollout payment-api ## Set image to trigger a new rolloutkubectl argo rollouts set image payment-api \ payment-api=payment-api:v1.2.4Troubleshooting
| Symptom | Check | What to Look For |
|---|---|---|
| Analysis always failing | Check Prometheus query | Query returning wrong metric or no data |
| Rollout stuck at pause | Check analysis results | Manual promotion may be needed |
| Canary not receiving traffic | Check service selector | Canary service label matching |
| Old pods not scaling down | Check HPA configuration | HPA may conflict with rollout replicas |
PLACEMENT PRO TIP**Tip:** Start with a simple canary without analysis — just `setWeight` steps and manual `pause` periods. Get comfortable with the rollout lifecycle before adding automated Prometheus analysis. A manually controlled canary is far safer than a broken automated analysis that always passes.
REMEMBER THIS**Remember:** Argo Rollouts replaces the Kubernetes Deployment resource — you use `kind: Rollout` instead of `kind: Deployment`. Existing Deployments can be converted, but this requires a migration step. For new services, deploy as Rollouts from the start.
COMMON MISTAKE / WARNING**Security:** Analysis templates that query Prometheus must be scoped carefully. A poorly written PromQL query that returns no data (empty result) will cause the analysis to pass by default — meaning broken Prometheus monitoring silently disables your safety gate. Always test analysis templates with synthetic load before relying on them in production.
COMMON MISTAKE / WARNING**Common Mistake:** Setting canary weight too high for the first step. Starting at 50% canary means half your production traffic is on an untested version. Start at 5-10% for the first canary step. The purpose of the initial step is to expose the new version to a small, representative traffic sample — not to immediately share the load.