Progressive Delivery: Canary Deployments with Argo Rollouts and Flagger

Progressive delivery lets you ship to 5% of users first and roll back in 30 seconds if something breaks — here is how to implement canary deployments with Argo Rollouts and Flagger on Kubernetes.

Status: DRAFT

The traditional deployment model has a binary risk profile: either zero users see the new code, or all users do. The moment you flip that switch, any bug, performance regression, or configuration mistake hits your entire user base simultaneously.

A Swiggy deployment at peak dinner time that breaks the order summary screen is not a minor event — it affects hundreds of thousands of active sessions instantly. A platform that ships bad code to 5% of users for five minutes before auto-rolling back has a completely different risk profile.

This is progressive delivery: gradual, automated, metric-driven rollout of new versions.

What Progressive Delivery Actually Means

Progressive delivery is the umbrella term for deployment strategies that release changes incrementally and use real production metrics to decide whether to continue, pause, or roll back automatically.

The three most common strategies:

Canary deployment: route a small percentage of traffic (5-10%) to the new version. Monitor error rate, latency, and business metrics. If metrics stay healthy, gradually increase traffic percentage. If metrics degrade, roll back automatically.

Blue-green deployment: run two identical environments, flip all traffic at once, but keep the old environment live for instant rollback.

Feature flags: release the code to all users but hide the feature behind a flag, enabling per-user or per-cohort gradual rollout.

Canary deployments are the most nuanced and the most powerful — they give you real production traffic testing without the all-or-nothing risk.

How a Canary Rollout Works

◈ DIAGRAM

v1 (stable) receives 95% of traffic
v2 (canary)  receives  5% of traffic
      |
      | Monitor for 10 minutes
      |
Error rate OK? Yes  -> v2 gets 30%
Error rate OK? Yes  -> v2 gets 60%
Error rate OK? Yes  -> v2 gets 100% (rollout complete)
      |
Error rate high? -> auto-rollback to v1 (all traffic)

The automation is what makes this production-grade. A manually-managed canary that requires a human to check metrics and adjust traffic percentages every ten minutes is too slow and too unreliable. Argo Rollouts and Flagger both automate this analysis loop.

Option 1: Argo Rollouts

Argo Rollouts is a Kubernetes controller that replaces the standard Deployment object with a Rollout CRD that supports canary and blue-green strategies natively.

Bash

## Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
 
## Install the CLI
brew install argoproj/tap/kubectl-argo-rollouts

Here is a Rollout manifest with a canary strategy tied to automated analysis:

YAML

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-api
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: checkout-api
  template:
    metadata:
      labels:
        app: checkout-api
    spec:
      containers:
        - name: checkout-api
          image: your-org/checkout-api:v2.1.0
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
  strategy:
    canary:
      steps:
        - setWeight: 5   ## send 5% of traffic to canary
        - pause:
            duration: 10m  ## wait 10 minutes
        - analysis:        ## run automated metric check
            templates:
              - templateName: success-rate-check
        - setWeight: 30
        - pause:
            duration: 10m
        - setWeight: 60
        - pause:
            duration: 10m
        - setWeight: 100
      canaryService: checkout-api-canary
      stableService: checkout-api-stable

The analysis step is what makes this automatic. It queries Prometheus, checks whether the error rate on the canary matches your baseline, and either continues the rollout or triggers a rollback.

The Analysis Template: Automated Metric Checking

YAML

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-check
  namespace: production
spec:
  metrics:
    - name: success-rate
      interval: 2m
      successCondition: result[0] >= 0.95  ## 95% success rate required
      failureLimit: 3  ## allow 3 failures before marking as failed
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(
              http_requests_total{
                app="checkout-api",
                version="canary",
                status!~"5.."
              }[2m]
            ))
            /
            sum(rate(
              http_requests_total{
                app="checkout-api",
                version="canary"
              }[2m]
            ))
    - name: latency-p99
      interval: 2m
      successCondition: result[0] <= 0.5  ## p99 under 500ms
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              rate(
                http_request_duration_seconds_bucket{
                  app="checkout-api",
                  version="canary"
                }[2m]
              )
            )

If the success rate drops below 95% or P99 latency exceeds 500ms during the canary phase, Argo Rollouts automatically rolls back all traffic to the stable version — no human intervention needed at 3 AM.

Option 2: Flagger

Flagger takes a different approach. Instead of replacing Deployment, it wraps your existing Deployment and manages a shadow canary deployment alongside it. This means you can adopt Flagger without changing any existing manifests.

Bash

## Install Flagger for Nginx ingress
helm repo add flagger https://flagger.app
helm upgrade -i flagger flagger/flagger \
  --namespace flagger-system \
  --create-namespace \
  --set meshProvider=nginx \
  --set metricsServer=http://prometheus.monitoring:9090

Flagger configuration wraps your existing Deployment:

YAML

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout-api
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api  ## your existing Deployment
  progressDeadlineSeconds: 600
  service:
    port: 80
    targetPort: 8080
  analysis:
    interval: 2m       ## check metrics every 2 minutes
    threshold: 5       ## max 5 failed checks before rollback
    maxWeight: 50      ## max traffic to canary: 50%
    stepWeight: 10     ## increase by 10% each step
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99  ## 99% success rate required
        interval: 2m
      - name: request-duration
        thresholdRange:
          max: 500  ## p99 under 500ms
        interval: 2m

When you update the container image in your Deployment, Flagger detects the change and automatically starts a canary analysis. No new CRDs to manage, no changes to existing manifests.

Choosing Between Argo Rollouts and Flagger

Factor	Argo Rollouts	Flagger
Replaces Deployment?	Yes (Rollout CRD)	No (wraps existing)
GitOps integration	Native with Argo CD	Works with any GitOps
UI visibility	Argo CD dashboard	Prometheus/Grafana
Traffic control	Fine-grained steps	Percentage-based steps
Mesh support	Istio, Nginx, etc	Istio, Linkerd, Nginx

Pick Argo Rollouts if you are already running ArgoCD — the integration is native and the UI shows canary progress alongside your other application state. Pick Flagger if you want to preserve existing Deployment objects and GitOps tooling.

Traffic Splitting with Nginx Ingress

Both tools use ingress annotations for traffic splitting with Nginx:

YAML

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: checkout-api
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  ## 10% to canary
spec:
  rules:
    - host: checkout.internal.yourplatform.net
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: checkout-api-canary
                port:
                  number: 80

Production Implementation Guidelines

Always pair canary deployments with meaningful metrics, not just infrastructure metrics. HTTP error rate tells you the API is broken. It does not tell you that checkout completion rate dropped 2% because of a subtle UI regression. Wire your business metrics — orders_completed_total, payment_success_rate — into your analysis templates alongside technical metrics.

Start with a longer canary duration than you think you need. A 10-minute canary is fine for catching hard errors. Catching performance regressions that only appear under sustained load requires 30-60 minutes. Zerodha's trading platform, for example, needs canary analysis to cover the market-open traffic spike window — a 10-minute canary that runs at 9:05 AM IST will not see 9:15 AM traffic patterns.

Set a clear rollback condition and trust it. If your analysis template says "roll back on > 1% error rate," and the canary hits 1.1%, roll it back — even if you have a hunch it might recover. Manual overrides of automatic rollbacks erode trust in the automation and lead teams to disable it entirely.

INFORMATION
📚 **References & Further Reading** * [Argo Rollouts Documentation](https://argoproj.github.io/argo-rollouts/) - Official guide and CRD reference * [Flagger Documentation](https://docs.flagger.app/) - Flagger setup and analysis reference * [Progressive Delivery with Argo Rollouts](https://codefresh.io/learn/argo-rollouts/) - Practical walkthroughs * [DORA Metrics](https://dora.dev/guides/dora-metrics-four-keys/) - Deploy frequency and change failure rate benchmarks

Frequently Asked Questions

How do you configure Argo Rollouts canary analysis to account for traffic volume too low to be statistically significant during off-peak hours?

Add a minRequestCount condition to your AnalysisTemplate so the metric evaluation is skipped if traffic volume is below a threshold, preventing false rollbacks from sparse data. Use a longer analysis interval during off-peak windows and pair it with a maximumEligibleWeight cap to avoid promoting a canary to high traffic percentages during low-signal periods.

Why does Flagger rollback fail to restore stable traffic when using Nginx ingress with canary weight annotations instead of a service mesh?

Nginx canary weight annotations are eventually consistent — the ingress controller reconciles them asynchronously. During a rapid Flagger rollback, there is a window where the annotation is updated but Nginx has not yet reloaded. Mitigate by increasing Flagger's rollback timeout and setting nginx.ingress.kubernetes.io/proxy-next-upstream to handle upstream errors gracefully during the transition window.

What Progressive Delivery Actually Means

How a Canary Rollout Works

Option 1: Argo Rollouts

The Analysis Template: Automated Metric Checking

Option 2: Flagger

Choosing Between Argo Rollouts and Flagger

Traffic Splitting with Nginx Ingress

Production Implementation Guidelines

Frequently Asked Questions

How do you configure Argo Rollouts canary analysis to account for traffic volume too low to be statistically significant during off-peak hours?

Why does Flagger rollback fail to restore stable traffic when using Nginx ingress with canary weight annotations instead of a service mesh?

Discussion0