What is the career path for learning Blue-Green and Canary Deployments in CI/CD Pipelines?

Mastering Blue-Green and Canary Deployments in CI/CD Pipelines enables engineering opportunities in DevOps, SRE, and cloud platform automation.

How long does it take to learn Blue-Green and Canary Deployments in CI/CD Pipelines?

Most students gain core proficiency in Blue-Green and Canary Deployments in CI/CD Pipelines in 2–3 weeks of active hands-on labs.

Blue-Green and Canary Deployments in CI/CD Pipelines | DevOps Network

Overview and What You Will Learn

This guide covers the two most common zero-downtime deployment patterns used in production Kubernetes environments -- blue-green and canary -- and how to automate both directly inside a CI/CD pipeline. You will learn how to switch traffic between a blue and green Service, how to use Argo Rollouts to step traffic toward a canary in small percentage increments with automated Prometheus metric gates, and how to wire automated rollback so a bad deploy gets pulled back without anyone paging on-call at 2am.

Why This Matters in Production

A plain rolling update replaces pods gradually but still sends full production traffic to every new pod the moment it passes its readiness probe -- a readiness probe checking "is the process up" tells you nothing about whether the new code is actually correct under real traffic. Blue- green and canary both solve this by keeping a safety net: blue-green lets you smoke-test the entire new version before it receives any real traffic at all, and canary exposes the new version to a small, bounded slice of real traffic first, with automated analysis deciding whether to continue. Teams running payment-critical paths -- the kind of traffic Razorpay or PhonePe see -- lean on canary specifically because it bounds the blast radius of a bad deploy to a known small percentage of requests, for a known short window, before it ever reaches everyone.

COMMON MISTAKE / WARNING
**Common Mistake:** Calling a deployment "canary" because it uses two Kubernetes Deployments, but promoting from 5% to 100% with one manual click and no automated metric check in between. Without an automated gate, this is just a slower, more complicated rolling update -- the value of canary is the automated analysis deciding whether to proceed, not the percentage stepping by itself.

Core Principles

Blue-green deployment

◈ DIAGRAM

+------------------------------------------+
|            BLUE (current)                |
+------------------------------------------+
| Live now, serving 100% of traffic        |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
|            GREEN (new)                   |
+------------------------------------------+
| Idle, just deployed, smoke-tested first  |
+------------------------------------------+

Deploy the new version (green) alongside the current version (blue) -- green receives zero production traffic initially.
Run smoke tests directly against green's internal service address.
Switch the production Service's selector from blue's pods to green's pods -- traffic now moves to green instantly, for every request at once.
Rollback is just switching the selector back to blue -- close to instant, since blue's pods never stopped running.

YAML

apiVersion: v1
kind: Service
metadata:
  name: order-gateway
spec:
  selector:
    app: order-gateway
    version: green   # was 'blue' before the switch
  ports:
    - port: 80
      targetPort: 8080

COMMON MISTAKE / WARNING
**Security:** Blue-green keeps the old version's pods running and fully provisioned during the cutover window -- make sure both versions' pods are covered by the same network policies and secret access, not just whichever version currently has the live Service selector.

Canary deployment with Argo Rollouts

◈ DIAGRAM

+------------------------------------------+
| Deploy canary: 5 percent of traffic      |
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Prometheus checks error rate for 5 min   |
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Error rate ok: promote to 25 percent     |
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Repeat checks at 50, then 100 percent    |
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Error rate bad: auto rollback to stable  |
+------------------------------------------+

Argo Rollouts replaces a standard Deployment with a Rollout resource that understands traffic-weighted steps and can pause automatically between them for analysis.

YAML

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-gateway
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-check
        - setWeight: 25
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-check
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

Automated promotion gates with Prometheus

YAML

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.01
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{job="order-gateway",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{job="order-gateway"}[5m]))

If error-rate ever exceeds 1% during an analysis window, Argo Rollouts automatically aborts the rollout and reverts traffic weight back to the stable version -- no human in the loop is required for the common case.

PLACEMENT PRO TIP
**Tip:** Run the canary's analysis query against the canary pods specifically (label-scoped), not the whole service. A canary running at 5% weight whose own error rate spikes can still look fine in an aggregate metric dominated by the 95% of traffic still hitting the stable version.

Istio traffic weighting

For precise percentage-based splitting below what Kubernetes Service load-balancing can reliably guarantee at low pod counts, Argo Rollouts integrates with Istio's VirtualService to set exact traffic weights independent of how many pods are actually running on each side.

Feature flags as an alternative

A feature flag service (LaunchDarkly-style, or a simple in-house flag table) lets you canary a code path to a percentage of users without any deployment-level traffic splitting at all -- useful when the risk lives in application logic rather than infrastructure behaviour, and faster to roll back since flipping a flag needs no new deployment.

Detailed Step-by-Step Practical Lab

This lab implements an automated canary rollout for CRED's order-gateway service using Argo Rollouts with a Prometheus error-rate gate.

Milestone 1 — Install the Argo Rollouts controller

Bash

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
kubectl argo rollouts version

At this point the controller is running but no Rollout resources exist yet.

Milestone 2 — Convert the existing Deployment to a Rollout

Replace kind: Deployment with kind: Rollout and add the canary strategy block shown in Core Principles, keeping the same pod template.

Bash

kubectl apply -f order-gateway-rollout.yaml
kubectl argo rollouts get rollout order-gateway --watch

At this point a new image push will trigger a canary rollout instead of a standard rolling update.

Milestone 3 — Define the AnalysisTemplate

Bash

kubectl apply -f error-rate-analysistemplate.yaml

At this point the analysis template exists in the cluster but has not been referenced by a live rollout step yet.

Milestone 4 — Trigger a canary rollout from CI

YAML

- name: Update image and trigger rollout
  run: |
    kubectl argo rollouts set image order-gateway \
      order-gateway=418773912004.dkr.ecr.ap-south-1.amazonaws.com/order-gateway:${{ github.sha }}

At this point kubectl argo rollouts get rollout order-gateway --watch should show weight stepping to 5%, then pausing for analysis.

Milestone 5 — Verify automated rollback on a failing analysis

Deploy a version that deliberately returns 500s on 5% of requests and confirm the rollout aborts on its own.

Bash

kubectl argo rollouts get rollout order-gateway
# Status should show "Degraded" and weight reverted to 0

At this point you have confirmed the safety net actually works before trusting it with a real production deploy.

Milestone 6 — Promote a healthy canary through to 100 percent

Bash

kubectl argo rollouts promote order-gateway
kubectl argo rollouts get rollout order-gateway --watch

At this point traffic weight should step through the remaining stages automatically until the new version is serving 100% of traffic.

REMEMBER THIS
**Remember:** A `pause: {}` step with no duration pauses indefinitely until a human runs `kubectl argo rollouts promote` -- decide deliberately which steps should be fully automated and which should wait for a human, rather than leaving all steps on indefinite manual pause by default.

Production Best Practices & Common Pitfalls

Scope analysis queries to the canary's pods specifically, not aggregate service-wide metrics that can mask a canary-only regression.
Keep the first canary step small (5% or less) -- the whole point is bounding blast radius before you know the new version is safe.
Make sure database migrations applied alongside a canary are backward compatible with the stable version still running, since both versions run against the same database during the rollout window.
For blue-green, do not delete blue's pods immediately after cutover -- keep them warm for a defined rollback window before scaling to zero.
Alert on a rollout stuck in "Paused" for longer than expected -- this usually means a manual promotion step was forgotten, not that everything is fine.

Quick Reference & Troubleshooting Commands

Symptom	Command	What to Look For
Rollout stuck at a weight step	`kubectl argo rollouts get rollout order-gateway`	A manual pause step awaiting promote, not a failed analysis
Analysis always fails immediately	`kubectl describe analysisrun <name>`	Prometheus query returning no data, often a wrong label selector
Traffic not actually split as expected	`kubectl get virtualservice order-gateway -o yaml`	Istio VirtualService weights not synced with the Rollout step
Blue-green switch did not move traffic	`kubectl get svc order-gateway -o yaml`	Service selector still pointing at the old version label

Blue-Green and Canary Deployments in CI/CD Pipelines

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Production Best Practices & Common Pitfalls

Quick Reference & Troubleshooting Commands

Resources

Explore More in Deployment Strategies and GitOps

Pipeline Security — Secrets, OIDC, and Least-Privilege CI/CD

DORA Metrics — Measuring and Improving CI/CD Pipeline Performance

GitOps with ArgoCD — Declarative Kubernetes Deployments from Git

Environment Promotion — Dev to Staging to Production Pipelines