Overview and What You Will Learn
This guide covers the two most common zero-downtime deployment patterns used in production Kubernetes environments -- blue-green and canary -- and how to automate both directly inside a CI/CD pipeline. You will learn how to switch traffic between a blue and green Service, how to use Argo Rollouts to step traffic toward a canary in small percentage increments with automated Prometheus metric gates, and how to wire automated rollback so a bad deploy gets pulled back without anyone paging on-call at 2am.
Why This Matters in Production
A plain rolling update replaces pods gradually but still sends full production traffic to every new pod the moment it passes its readiness probe -- a readiness probe checking "is the process up" tells you nothing about whether the new code is actually correct under real traffic. Blue- green and canary both solve this by keeping a safety net: blue-green lets you smoke-test the entire new version before it receives any real traffic at all, and canary exposes the new version to a small, bounded slice of real traffic first, with automated analysis deciding whether to continue. Teams running payment-critical paths -- the kind of traffic Razorpay or PhonePe see -- lean on canary specifically because it bounds the blast radius of a bad deploy to a known small percentage of requests, for a known short window, before it ever reaches everyone.
COMMON MISTAKE / WARNING**Common Mistake:** Calling a deployment "canary" because it uses two Kubernetes Deployments, but promoting from 5% to 100% with one manual click and no automated metric check in between. Without an automated gate, this is just a slower, more complicated rolling update -- the value of canary is the automated analysis deciding whether to proceed, not the percentage stepping by itself.
Core Principles
Blue-green deployment
+------------------------------------------+| BLUE (current) |+------------------------------------------+| Live now, serving 100% of traffic |+------------------------------------------+ | v+------------------------------------------+| GREEN (new) |+------------------------------------------+| Idle, just deployed, smoke-tested first |+------------------------------------------+- Deploy the new version (green) alongside the current version (blue) -- green receives zero production traffic initially.
- Run smoke tests directly against green's internal service address.
- Switch the production Service's selector from blue's pods to green's pods -- traffic now moves to green instantly, for every request at once.
- Rollback is just switching the selector back to blue -- close to instant, since blue's pods never stopped running.
apiVersion: v1kind: Servicemetadata: name: order-gatewayspec: selector: app: order-gateway version: green # was 'blue' before the switch ports: - port: 80 targetPort: 8080COMMON MISTAKE / WARNING**Security:** Blue-green keeps the old version's pods running and fully provisioned during the cutover window -- make sure both versions' pods are covered by the same network policies and secret access, not just whichever version currently has the live Service selector.
Canary deployment with Argo Rollouts
+------------------------------------------+| Deploy canary: 5 percent of traffic |+------------------------------------------+ | v+------------------------------------------+| Prometheus checks error rate for 5 min |+------------------------------------------+ | v+------------------------------------------+| Error rate ok: promote to 25 percent |+------------------------------------------+ | v+------------------------------------------+| Repeat checks at 50, then 100 percent |+------------------------------------------+ | v+------------------------------------------+| Error rate bad: auto rollback to stable |+------------------------------------------+Argo Rollouts replaces a standard Deployment with a Rollout resource that understands traffic-weighted steps and can pause automatically between them for analysis.
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: order-gatewayspec: replicas: 10 strategy: canary: steps: - setWeight: 5 - pause: { duration: 5m } - analysis: templates: - templateName: error-rate-check - setWeight: 25 - pause: { duration: 5m } - analysis: templates: - templateName: error-rate-check - setWeight: 50 - pause: { duration: 5m } - setWeight: 100Automated promotion gates with Prometheus
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: error-rate-checkspec: metrics: - name: error-rate interval: 1m successCondition: result[0] < 0.01 provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{job="order-gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="order-gateway"}[5m]))If error-rate ever exceeds 1% during an analysis window, Argo Rollouts automatically aborts the rollout and reverts traffic weight back to the stable version -- no human in the loop is required for the common case.
PLACEMENT PRO TIP**Tip:** Run the canary's analysis query against the canary pods specifically (label-scoped), not the whole service. A canary running at 5% weight whose own error rate spikes can still look fine in an aggregate metric dominated by the 95% of traffic still hitting the stable version.
Istio traffic weighting
For precise percentage-based splitting below what Kubernetes Service load-balancing can reliably guarantee at low pod counts, Argo Rollouts integrates with Istio's VirtualService to set exact traffic weights independent of how many pods are actually running on each side.
Feature flags as an alternative
A feature flag service (LaunchDarkly-style, or a simple in-house flag table) lets you canary a code path to a percentage of users without any deployment-level traffic splitting at all -- useful when the risk lives in application logic rather than infrastructure behaviour, and faster to roll back since flipping a flag needs no new deployment.
Detailed Step-by-Step Practical Lab
This lab implements an automated canary rollout for CRED's order-gateway service using Argo Rollouts with a Prometheus error-rate gate.
Milestone 1 — Install the Argo Rollouts controller
kubectl create namespace argo-rolloutskubectl apply -n argo-rollouts \ -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yamlkubectl argo rollouts versionAt this point the controller is running but no Rollout resources exist yet.
Milestone 2 — Convert the existing Deployment to a Rollout
Replace kind: Deployment with kind: Rollout and add the canary
strategy block shown in Core Principles, keeping the same pod template.
kubectl apply -f order-gateway-rollout.yamlkubectl argo rollouts get rollout order-gateway --watchAt this point a new image push will trigger a canary rollout instead of a standard rolling update.
Milestone 3 — Define the AnalysisTemplate
kubectl apply -f error-rate-analysistemplate.yamlAt this point the analysis template exists in the cluster but has not been referenced by a live rollout step yet.
Milestone 4 — Trigger a canary rollout from CI
- name: Update image and trigger rollout run: | kubectl argo rollouts set image order-gateway \ order-gateway=418773912004.dkr.ecr.ap-south-1.amazonaws.com/order-gateway:${{ github.sha }}At this point kubectl argo rollouts get rollout order-gateway --watch
should show weight stepping to 5%, then pausing for analysis.
Milestone 5 — Verify automated rollback on a failing analysis
Deploy a version that deliberately returns 500s on 5% of requests and confirm the rollout aborts on its own.
kubectl argo rollouts get rollout order-gateway# Status should show "Degraded" and weight reverted to 0At this point you have confirmed the safety net actually works before trusting it with a real production deploy.
Milestone 6 — Promote a healthy canary through to 100 percent
kubectl argo rollouts promote order-gatewaykubectl argo rollouts get rollout order-gateway --watchAt this point traffic weight should step through the remaining stages automatically until the new version is serving 100% of traffic.
REMEMBER THIS**Remember:** A `pause: {}` step with no duration pauses indefinitely until a human runs `kubectl argo rollouts promote` -- decide deliberately which steps should be fully automated and which should wait for a human, rather than leaving all steps on indefinite manual pause by default.
Production Best Practices & Common Pitfalls
- Scope analysis queries to the canary's pods specifically, not aggregate service-wide metrics that can mask a canary-only regression.
- Keep the first canary step small (5% or less) -- the whole point is bounding blast radius before you know the new version is safe.
- Make sure database migrations applied alongside a canary are backward compatible with the stable version still running, since both versions run against the same database during the rollout window.
- For blue-green, do not delete blue's pods immediately after cutover -- keep them warm for a defined rollback window before scaling to zero.
- Alert on a rollout stuck in "Paused" for longer than expected -- this usually means a manual promotion step was forgotten, not that everything is fine.
Quick Reference & Troubleshooting Commands
| Symptom | Command | What to Look For |
|---|---|---|
| Rollout stuck at a weight step | kubectl argo rollouts get rollout order-gateway |
A manual pause step awaiting promote, not a failed analysis |
| Analysis always fails immediately | kubectl describe analysisrun <name> |
Prometheus query returning no data, often a wrong label selector |
| Traffic not actually split as expected | kubectl get virtualservice order-gateway -o yaml |
Istio VirtualService weights not synced with the Rollout step |
| Blue-green switch did not move traffic | kubectl get svc order-gateway -o yaml |
Service selector still pointing at the old version label |