Progressive delivery lets you ship to 5% of users first and roll back in 30 seconds if something breaks — here is how to implement canary deployments with Argo Rollouts and Flagger on Kubernetes.
Status: DRAFT
The traditional deployment model has a binary risk profile: either zero users see the new code, or all users do. The moment you flip that switch, any bug, performance regression, or configuration mistake hits your entire user base simultaneously.
A Swiggy deployment at peak dinner time that breaks the order summary screen is not a minor event — it affects hundreds of thousands of active sessions instantly. A platform that ships bad code to 5% of users for five minutes before auto-rolling back has a completely different risk profile.
This is progressive delivery: gradual, automated, metric-driven rollout of new versions.
Progressive delivery is the umbrella term for deployment strategies that release changes incrementally and use real production metrics to decide whether to continue, pause, or roll back automatically.
The three most common strategies:
Canary deployment: route a small percentage of traffic (5-10%) to the new version. Monitor error rate, latency, and business metrics. If metrics stay healthy, gradually increase traffic percentage. If metrics degrade, roll back automatically.
Blue-green deployment: run two identical environments, flip all traffic at once, but keep the old environment live for instant rollback.
Feature flags: release the code to all users but hide the feature behind a flag, enabling per-user or per-cohort gradual rollout.
Canary deployments are the most nuanced and the most powerful — they give you real production traffic testing without the all-or-nothing risk.
v1 (stable) receives 95% of trafficv2 (canary) receives 5% of traffic | | Monitor for 10 minutes |Error rate OK? Yes -> v2 gets 30%Error rate OK? Yes -> v2 gets 60%Error rate OK? Yes -> v2 gets 100% (rollout complete) |Error rate high? -> auto-rollback to v1 (all traffic)The automation is what makes this production-grade. A manually-managed canary that requires a human to check metrics and adjust traffic percentages every ten minutes is too slow and too unreliable. Argo Rollouts and Flagger both automate this analysis loop.
Argo Rollouts is a Kubernetes controller that replaces the standard Deployment object with a Rollout CRD that supports canary and blue-green strategies natively.
## Install Argo Rolloutskubectl create namespace argo-rolloutskubectl apply -n argo-rollouts \ -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml ## Install the CLIbrew install argoproj/tap/kubectl-argo-rolloutsHere is a Rollout manifest with a canary strategy tied to automated analysis:
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: checkout-api namespace: productionspec: replicas: 10 selector: matchLabels: app: checkout-api template: metadata: labels: app: checkout-api spec: containers: - name: checkout-api image: your-org/checkout-api:v2.1.0 resources: requests: cpu: 200m memory: 256Mi strategy: canary: steps: - setWeight: 5 ## send 5% of traffic to canary - pause: duration: 10m ## wait 10 minutes - analysis: ## run automated metric check templates: - templateName: success-rate-check - setWeight: 30 - pause: duration: 10m - setWeight: 60 - pause: duration: 10m - setWeight: 100 canaryService: checkout-api-canary stableService: checkout-api-stableThe analysis step is what makes this automatic. It queries Prometheus, checks whether the error rate on the canary matches your baseline, and either continues the rollout or triggers a rollback.
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: success-rate-check namespace: productionspec: metrics: - name: success-rate interval: 2m successCondition: result[0] >= 0.95 ## 95% success rate required failureLimit: 3 ## allow 3 failures before marking as failed provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate( http_requests_total{ app="checkout-api", version="canary", status!~"5.." }[2m] )) / sum(rate( http_requests_total{ app="checkout-api", version="canary" }[2m] )) - name: latency-p99 interval: 2m successCondition: result[0] <= 0.5 ## p99 under 500ms provider: prometheus: address: http://prometheus.monitoring:9090 query: | histogram_quantile(0.99, rate( http_request_duration_seconds_bucket{ app="checkout-api", version="canary" }[2m] ) )If the success rate drops below 95% or P99 latency exceeds 500ms during the canary phase, Argo Rollouts automatically rolls back all traffic to the stable version — no human intervention needed at 3 AM.
Flagger takes a different approach. Instead of replacing Deployment, it wraps your existing Deployment and manages a shadow canary deployment alongside it. This means you can adopt Flagger without changing any existing manifests.
## Install Flagger for Nginx ingresshelm repo add flagger https://flagger.apphelm upgrade -i flagger flagger/flagger \ --namespace flagger-system \ --create-namespace \ --set meshProvider=nginx \ --set metricsServer=http://prometheus.monitoring:9090Flagger configuration wraps your existing Deployment:
apiVersion: flagger.app/v1beta1kind: Canarymetadata: name: checkout-api namespace: productionspec: targetRef: apiVersion: apps/v1 kind: Deployment name: checkout-api ## your existing Deployment progressDeadlineSeconds: 600 service: port: 80 targetPort: 8080 analysis: interval: 2m ## check metrics every 2 minutes threshold: 5 ## max 5 failed checks before rollback maxWeight: 50 ## max traffic to canary: 50% stepWeight: 10 ## increase by 10% each step metrics: - name: request-success-rate thresholdRange: min: 99 ## 99% success rate required interval: 2m - name: request-duration thresholdRange: max: 500 ## p99 under 500ms interval: 2mWhen you update the container image in your Deployment, Flagger detects the change and automatically starts a canary analysis. No new CRDs to manage, no changes to existing manifests.
| Factor | Argo Rollouts | Flagger |
|---|---|---|
| Replaces Deployment? | Yes (Rollout CRD) | No (wraps existing) |
| GitOps integration | Native with Argo CD | Works with any GitOps |
| UI visibility | Argo CD dashboard | Prometheus/Grafana |
| Traffic control | Fine-grained steps | Percentage-based steps |
| Mesh support | Istio, Nginx, etc | Istio, Linkerd, Nginx |
Pick Argo Rollouts if you are already running ArgoCD — the integration is native and the UI shows canary progress alongside your other application state. Pick Flagger if you want to preserve existing Deployment objects and GitOps tooling.
Both tools use ingress annotations for traffic splitting with Nginx:
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: checkout-api annotations: nginx.ingress.kubernetes.io/canary: "true" nginx.ingress.kubernetes.io/canary-weight: "10" ## 10% to canaryspec: rules: - host: checkout.internal.yourplatform.net http: paths: - path: / pathType: Prefix backend: service: name: checkout-api-canary port: number: 80Always pair canary deployments with meaningful metrics, not just infrastructure metrics. HTTP error rate tells you the API is broken. It does not tell you that checkout completion rate dropped 2% because of a subtle UI regression. Wire your business metrics — orders_completed_total, payment_success_rate — into your analysis templates alongside technical metrics.
Start with a longer canary duration than you think you need. A 10-minute canary is fine for catching hard errors. Catching performance regressions that only appear under sustained load requires 30-60 minutes. Zerodha's trading platform, for example, needs canary analysis to cover the market-open traffic spike window — a 10-minute canary that runs at 9:05 AM IST will not see 9:15 AM traffic patterns.
Set a clear rollback condition and trust it. If your analysis template says "roll back on > 1% error rate," and the canary hits 1.1%, roll it back — even if you have a hunch it might recover. Manual overrides of automatic rollbacks erode trust in the automation and lead teams to disable it entirely.
INFORMATION📚 **References & Further Reading** * [Argo Rollouts Documentation](https://argoproj.github.io/argo-rollouts/) - Official guide and CRD reference * [Flagger Documentation](https://docs.flagger.app/) - Flagger setup and analysis reference * [Progressive Delivery with Argo Rollouts](https://codefresh.io/learn/argo-rollouts/) - Practical walkthroughs * [DORA Metrics](https://dora.dev/guides/dora-metrics-four-keys/) - Deploy frequency and change failure rate benchmarks
Add a minRequestCount condition to your AnalysisTemplate so the metric evaluation is skipped if traffic volume is below a threshold, preventing false rollbacks from sparse data. Use a longer analysis interval during off-peak windows and pair it with a maximumEligibleWeight cap to avoid promoting a canary to high traffic percentages during low-signal periods.
Nginx canary weight annotations are eventually consistent — the ingress controller reconciles them asynchronously. During a rapid Flagger rollback, there is a window where the annotation is updated but Nginx has not yet reloaded. Mitigate by increasing Flagger's rollback timeout and setting nginx.ingress.kubernetes.io/proxy-next-upstream to handle upstream errors gracefully during the transition window.
Discussion0