What is Canary Deployment? | DevOps Dictionary

Canary Deployment — Test Your Release With Real Users Safely

What is Canary Deployment in Simple Terms?

A canary deployment lets you test a new version of your application on real production traffic — but only for a small, controlled percentage of users. If the new version has a bug, only 5-10% of users are affected instead of everyone. You catch the problem, scale the canary back to zero, and only a fraction of your users ever noticed.

The name comes from coal miners who carried canary birds into mines to detect dangerous gases. If the canary died, the miners knew to evacuate before the gas reached them. In software, the canary pods detect problems before they affect the full user base.

Hotstar uses canary deployments for streaming API changes during off-peak hours — routing 5% of traffic to the new version while 95% stays on stable. If streaming quality metrics drop for the canary users, the canary is pulled back before it affects anyone else.

How Traffic Splitting Works With Replica Counts

The simplest canary strategy in Kubernetes uses replica ratios. The Service routes traffic roughly evenly across all pods with the matching label — so 1 canary pod out of 10 total means 10% of traffic hits the canary:

◈ DIAGRAM

| 9 stable pods (v1.9.0)  +  1 canary pod (v2.0.0)  = 10 pods total
| Service routes to all 10 pods (matched by app: notification-api)
|
| Approximate traffic split:
| 90% -> stable pods  (9 out of 10)
| 10% -> canary pod   (1 out of 10)
 
+-----------------------+   +----------------------+
| notification-stable   |   | notification-canary  |
| replicas: 9 (v1.9.0)  |   | replicas: 1 (v2.0.0) |
| track: stable         |   | track: canary        |
+-----------------------+   +----------------------+
            |                           |
            +---------------------------+
                           |
                +---------------------+
                | Service             |
                | selector:           |
                |   app: notif-api    |
                | (matches BOTH)      |
                +---------------------+
                           |
                    Internet traffic

Setting Up a Canary Deployment

YAML

# stable-deployment.yaml — current production version
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notification-api-stable
  namespace: production
spec:
  replicas: 9
  selector:
    matchLabels:
      app: notification-api
      track: stable
  template:
    metadata:
      labels:
        app: notification-api
        track: stable
        version: v1.9.0
    spec:
      containers:
        * name: notification-api
          image: registry.hotstar.in/notification-api:v1.9.0
          ports:
            * containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
---
# canary-deployment.yaml — new version under test
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notification-api-canary
  namespace: production
spec:
  replicas: 1             # Start with just 1 out of 10 pods = 10% traffic
  selector:
    matchLabels:
      app: notification-api
      track: canary
  template:
    metadata:
      labels:
        app: notification-api
        track: canary
        version: v2.0.0
    spec:
      containers:
        * name: notification-api
          image: registry.hotstar.in/notification-api:v2.0.0
          ports:
            * containerPort: 8080
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
---
# service.yaml — routes to BOTH deployments via shared app label
apiVersion: v1
kind: Service
metadata:
  name: notification-api
  namespace: production
spec:
  selector:
    app: notification-api    # Matches BOTH stable and canary pods
  ports:
    * name: http
      port: 80
      targetPort: 8080

The Complete Canary Release Workflow

Bash

# Step 1 — Deploy canary at 10% (1 canary, 9 stable)
kubectl apply -f canary-deployment.yaml
kubectl rollout status deployment/notification-api-canary -n production
 
# Step 2 — Verify canary pod is running and receiving traffic
kubectl get pods -n production -l track=canary
# NAME                                    READY   STATUS
# notification-api-canary-7d9f8c-xkp2q   1/1     Running
 
# Step 3 — Monitor canary for 15-30 minutes
# Compare error rates between stable and canary:
# Prometheus query:
# sum by (track) (
#   rate(http_requests_total{status=~"5..", namespace="production"}[5m])
# ) /
# sum by (track) (
#   rate(http_requests_total{namespace="production"}[5m])
# ) * 100
 
# Watch canary logs for errors
kubectl logs -f -l track=canary -n production
 
# Check canary pod resource usage
kubectl top pods -n production -l track=canary
 
# Step 4 — Gradually increase canary traffic if healthy
# Increase to 30%: 3 canary, 7 stable
kubectl scale deployment notification-api-canary --replicas=3 -n production
kubectl scale deployment notification-api-stable --replicas=7 -n production
 
# Wait 15 minutes, monitor again
 
# Increase to 50%: 5 canary, 5 stable
kubectl scale deployment notification-api-canary --replicas=5 -n production
kubectl scale deployment notification-api-stable --replicas=5 -n production
 
# Increase to 100%: 10 canary, 0 stable
kubectl scale deployment notification-api-canary --replicas=10 -n production
kubectl scale deployment notification-api-stable --replicas=0 -n production
 
# Step 5 — Clean up: rename canary to stable, delete old stable
kubectl delete deployment notification-api-stable -n production
 
# Step 6 — EMERGENCY ROLLBACK (can happen at any stage)
# Instantly remove all canary traffic
kubectl scale deployment notification-api-canary --replicas=0 -n production
kubectl scale deployment notification-api-stable --replicas=10 -n production
# All traffic is back on stable in under 30 seconds

What to Monitor During a Canary Release

Bash

# The four key metrics to watch for the canary version:
 
# 1. Error rate (most important)
# Any increase in 5xx errors on canary pods signals a problem
sum by (track) (
  rate(http_requests_total{status=~"5..", namespace="production"}[2m])
) / sum by (track) (
  rate(http_requests_total{namespace="production"}[2m])
) * 100
# Alert threshold: canary error rate > 2x stable error rate
 
# 2. Response latency
# Canary pods running slower signals performance regression
histogram_quantile(0.99,
  sum by (track, le) (
    rate(http_request_duration_seconds_bucket{namespace="production"}[2m])
  )
)
# Alert threshold: canary P99 latency > 1.5x stable P99 latency
 
# 3. Pod restart count
# Canary pods crashing signals a startup or runtime error
rate(kube_pod_container_status_restarts_total{
  namespace="production",
  pod=~"notification-api-canary-.*"
}[5m]) * 300 > 0
 
# 4. Memory usage
# Memory growing faster than stable suggests a memory leak
container_memory_working_set_bytes{
  namespace="production",
  pod=~"notification-api-canary-.*"
}

Canary vs Blue-Green vs Rolling Update

Strategy	Traffic Control	Rollback Speed	Cost	Best For
Rolling Update	Gradual pod replacement	2-5 minutes	Normal	Standard feature releases
Canary	Precise % control	Instant (scale to 0)	+10% pods	Risky changes, new features
Blue-Green	All-or-nothing switch	Instant (service patch)	+100% pods	Payment flows, schema changes

REMEMBER THIS
**Remember:** Always monitor the canary for at least 15 minutes at each traffic percentage before increasing. Error rates, P99 latency, and memory usage are the three signals that matter. A canary that looks healthy at 10% for 5 minutes but has a memory leak will only show the problem at 30% after 20 minutes — giving you enough time to catch it before full rollout.

PLACEMENT PRO TIP
**Tip:** Automate the canary promotion with a script that checks error rates automatically. If the canary error rate stays below 1% for 15 minutes at the current traffic percentage, automatically increase to the next level. If error rate exceeds 2%, automatically scale canary to zero and alert the team. This removes the human decision point from the critical path and makes canary releases safe enough to run during business hours.

COMMON MISTAKE / WARNING
**Common Mistake:** Using a canary with only 1 pod at very low traffic. A single canary pod at 5% traffic may not receive enough requests to produce statistically meaningful error rates. If your service handles 10 requests per second total, the canary pod gets 0.5 requests per second — you need to wait much longer before the error rate is statistically reliable. Start canary at minimum 10% and ensure it is receiving at least 10 requests per minute before trusting the error rate metrics.

COMMON MISTAKE / WARNING
**Security:** In Kubernetes, canary traffic splitting by replica count is not precise — kube-proxy distributes traffic randomly across all pods, not proportionally by replica count. At low replica counts (1 canary, 9 stable) the actual split fluctuates between 5% and 15% depending on which pods are serving long-lived connections. For precise percentage control (exactly 5% to canary, exactly 95% to stable), use Istio's VirtualService with traffic weight configuration instead of replica-count-based splitting.