Canary Deployment — Test Your Release With Real Users Safely
What is Canary Deployment in Simple Terms?
A canary deployment lets you test a new version of your application on real production traffic — but only for a small, controlled percentage of users. If the new version has a bug, only 5-10% of users are affected instead of everyone. You catch the problem, scale the canary back to zero, and only a fraction of your users ever noticed.
The name comes from coal miners who carried canary birds into mines to detect dangerous gases. If the canary died, the miners knew to evacuate before the gas reached them. In software, the canary pods detect problems before they affect the full user base.
Hotstar uses canary deployments for streaming API changes during off-peak hours — routing 5% of traffic to the new version while 95% stays on stable. If streaming quality metrics drop for the canary users, the canary is pulled back before it affects anyone else.
How Traffic Splitting Works With Replica Counts
The simplest canary strategy in Kubernetes uses replica ratios. The Service routes traffic roughly evenly across all pods with the matching label — so 1 canary pod out of 10 total means 10% of traffic hits the canary:
| 9 stable pods (v1.9.0) + 1 canary pod (v2.0.0) = 10 pods total| Service routes to all 10 pods (matched by app: notification-api)|| Approximate traffic split:| 90% -> stable pods (9 out of 10)| 10% -> canary pod (1 out of 10) +-----------------------+ +----------------------+| notification-stable | | notification-canary || replicas: 9 (v1.9.0) | | replicas: 1 (v2.0.0) || track: stable | | track: canary |+-----------------------+ +----------------------+ | | +---------------------------+ | +---------------------+ | Service | | selector: | | app: notif-api | | (matches BOTH) | +---------------------+ | Internet trafficSetting Up a Canary Deployment
# stable-deployment.yaml — current production versionapiVersion: apps/v1kind: Deploymentmetadata: name: notification-api-stable namespace: productionspec: replicas: 9 selector: matchLabels: app: notification-api track: stable template: metadata: labels: app: notification-api track: stable version: v1.9.0 spec: containers: * name: notification-api image: registry.hotstar.in/notification-api:v1.9.0 ports: * containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi"# canary-deployment.yaml — new version under testapiVersion: apps/v1kind: Deploymentmetadata: name: notification-api-canary namespace: productionspec: replicas: 1 # Start with just 1 out of 10 pods = 10% traffic selector: matchLabels: app: notification-api track: canary template: metadata: labels: app: notification-api track: canary version: v2.0.0 spec: containers: * name: notification-api image: registry.hotstar.in/notification-api:v2.0.0 ports: * containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi"# service.yaml — routes to BOTH deployments via shared app labelapiVersion: v1kind: Servicemetadata: name: notification-api namespace: productionspec: selector: app: notification-api # Matches BOTH stable and canary pods ports: * name: http port: 80 targetPort: 8080The Complete Canary Release Workflow
# Step 1 — Deploy canary at 10% (1 canary, 9 stable)kubectl apply -f canary-deployment.yamlkubectl rollout status deployment/notification-api-canary -n production # Step 2 — Verify canary pod is running and receiving traffickubectl get pods -n production -l track=canary# NAME READY STATUS# notification-api-canary-7d9f8c-xkp2q 1/1 Running # Step 3 — Monitor canary for 15-30 minutes# Compare error rates between stable and canary:# Prometheus query:# sum by (track) (# rate(http_requests_total{status=~"5..", namespace="production"}[5m])# ) /# sum by (track) (# rate(http_requests_total{namespace="production"}[5m])# ) * 100 # Watch canary logs for errorskubectl logs -f -l track=canary -n production # Check canary pod resource usagekubectl top pods -n production -l track=canary # Step 4 — Gradually increase canary traffic if healthy# Increase to 30%: 3 canary, 7 stablekubectl scale deployment notification-api-canary --replicas=3 -n productionkubectl scale deployment notification-api-stable --replicas=7 -n production # Wait 15 minutes, monitor again # Increase to 50%: 5 canary, 5 stablekubectl scale deployment notification-api-canary --replicas=5 -n productionkubectl scale deployment notification-api-stable --replicas=5 -n production # Increase to 100%: 10 canary, 0 stablekubectl scale deployment notification-api-canary --replicas=10 -n productionkubectl scale deployment notification-api-stable --replicas=0 -n production # Step 5 — Clean up: rename canary to stable, delete old stablekubectl delete deployment notification-api-stable -n production # Step 6 — EMERGENCY ROLLBACK (can happen at any stage)# Instantly remove all canary traffickubectl scale deployment notification-api-canary --replicas=0 -n productionkubectl scale deployment notification-api-stable --replicas=10 -n production# All traffic is back on stable in under 30 secondsWhat to Monitor During a Canary Release
# The four key metrics to watch for the canary version: # 1. Error rate (most important)# Any increase in 5xx errors on canary pods signals a problemsum by (track) ( rate(http_requests_total{status=~"5..", namespace="production"}[2m])) / sum by (track) ( rate(http_requests_total{namespace="production"}[2m])) * 100# Alert threshold: canary error rate > 2x stable error rate # 2. Response latency# Canary pods running slower signals performance regressionhistogram_quantile(0.99, sum by (track, le) ( rate(http_request_duration_seconds_bucket{namespace="production"}[2m]) ))# Alert threshold: canary P99 latency > 1.5x stable P99 latency # 3. Pod restart count# Canary pods crashing signals a startup or runtime errorrate(kube_pod_container_status_restarts_total{ namespace="production", pod=~"notification-api-canary-.*"}[5m]) * 300 > 0 # 4. Memory usage# Memory growing faster than stable suggests a memory leakcontainer_memory_working_set_bytes{ namespace="production", pod=~"notification-api-canary-.*"}Canary vs Blue-Green vs Rolling Update
| Strategy | Traffic Control | Rollback Speed | Cost | Best For |
|---|---|---|---|---|
| Rolling Update | Gradual pod replacement | 2-5 minutes | Normal | Standard feature releases |
| Canary | Precise % control | Instant (scale to 0) | +10% pods | Risky changes, new features |
| Blue-Green | All-or-nothing switch | Instant (service patch) | +100% pods | Payment flows, schema changes |
REMEMBER THIS**Remember:** Always monitor the canary for at least 15 minutes at each traffic percentage before increasing. Error rates, P99 latency, and memory usage are the three signals that matter. A canary that looks healthy at 10% for 5 minutes but has a memory leak will only show the problem at 30% after 20 minutes — giving you enough time to catch it before full rollout.
PLACEMENT PRO TIP**Tip:** Automate the canary promotion with a script that checks error rates automatically. If the canary error rate stays below 1% for 15 minutes at the current traffic percentage, automatically increase to the next level. If error rate exceeds 2%, automatically scale canary to zero and alert the team. This removes the human decision point from the critical path and makes canary releases safe enough to run during business hours.
COMMON MISTAKE / WARNING**Common Mistake:** Using a canary with only 1 pod at very low traffic. A single canary pod at 5% traffic may not receive enough requests to produce statistically meaningful error rates. If your service handles 10 requests per second total, the canary pod gets 0.5 requests per second — you need to wait much longer before the error rate is statistically reliable. Start canary at minimum 10% and ensure it is receiving at least 10 requests per minute before trusting the error rate metrics.
COMMON MISTAKE / WARNING**Security:** In Kubernetes, canary traffic splitting by replica count is not precise — kube-proxy distributes traffic randomly across all pods, not proportionally by replica count. At low replica counts (1 canary, 9 stable) the actual split fluctuates between 5% and 15% depending on which pods are serving long-lived connections. For precise percentage control (exactly 5% to canary, exactly 95% to stable), use Istio's VirtualService with traffic weight configuration instead of replica-count-based splitting.