Blue-Green Deployment — Zero-Risk Production Releases
What is Blue-Green Deployment in Simple Terms?
Blue-green deployment solves one of the hardest problems in production engineering: how do you test the new version of your application with real production traffic without risking your users?
The answer: run both versions simultaneously. Blue is your current live version. Green is the new version. You validate green completely before switching any real traffic to it. When you switch, it is instant and total — 100% of traffic moves from blue to green in one command. If anything goes wrong, one more command switches everything back to blue. Your users experience zero downtime in either direction.
Razorpay uses blue-green for payment API deployments during off-peak hours. A bug in a new payment flow is caught in the green environment before any user transaction is affected.
The Architecture
BEFORE SWITCH — all users on blue: +------------------+ +------------------+| BLUE (v1 - LIVE) | | GREEN (v2 - IDLE)|| | | || 4 pods running | | 4 pods running || slot: blue | | slot: green || Serving traffic | | NOT serving |+------------------+ +------------------+ ^ |+------------------------------------------+| Kubernetes Service || selector: slot=blue | <- All traffic goes here+------------------------------------------+ ^ | Internet users AFTER SWITCH — all users on green: +------------------+ +------------------+| BLUE (v1 - IDLE) | | GREEN (v2 - LIVE)|| | | || 4 pods (scale=0) | | 4 pods running || slot: blue | | slot: green || NOT serving | | Serving traffic |+------------------+ +------------------+ ^ |+------------------------------------------+| Kubernetes Service || selector: slot=green | <- All traffic goes here now+------------------------------------------+ ^ | Internet usersSetting Up Blue-Green Deployments
Step 1: Create Both Deployments
# blue-deployment.yaml — current live versionapiVersion: apps/v1kind: Deploymentmetadata: name: payment-api-blue namespace: productionspec: replicas: 4 selector: matchLabels: app: payment-api slot: blue template: metadata: labels: app: payment-api slot: blue version: v3.0.0 spec: containers: * name: payment-api image: registry.razorpay.in/payment-api:v3.0.0 ports: * containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi"# green-deployment.yaml — new version being validatedapiVersion: apps/v1kind: Deploymentmetadata: name: payment-api-green namespace: productionspec: replicas: 4 selector: matchLabels: app: payment-api slot: green template: metadata: labels: app: payment-api slot: green version: v3.1.0 spec: containers: * name: payment-api image: registry.razorpay.in/payment-api:v3.1.0 ports: * containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 5 resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi"Step 2: Create the Service Pointing to Blue
# service.yaml — the single entry point, controls which slot is liveapiVersion: v1kind: Servicemetadata: name: payment-api namespace: productionspec: selector: app: payment-api slot: blue # Currently routing all traffic to blue pods ports: * name: http port: 80 targetPort: 8080Step 3: The Complete Release Workflow
# Step 1 — Deploy the new green versionkubectl apply -f green-deployment.yaml # Step 2 — Wait for all green pods to be readykubectl rollout status deployment/payment-api-green -n production# deployment "payment-api-green" successfully rolled out # Step 3 — Verify green is healthy before switching# Test via port-forward — users are NOT hitting green yetkubectl port-forward deployment/payment-api-green 8081:8080 -n production &curl http://localhost:8081/health# {"status": "healthy", "version": "v3.1.0"} # Run integration tests against green directlycurl http://localhost:8081/api/v1/payment/test# Verify all critical flows work in the new version # Step 4 — Switch ALL traffic from blue to green (instant)kubectl patch service payment-api -n production \ -p '{"spec":{"selector":{"slot":"green"}}}' # Step 5 — Verify the switch happenedkubectl describe service payment-api -n production | grep slot# Selector: app=payment-api,slot=green # Test with real traffic hitting greencurl https://api.razorpay.in/health# {"status": "healthy", "version": "v3.1.0"} # Step 6 — Watch error rates in Grafana for 15-30 minutes# If everything looks good proceed to cleanup # Step 7 — Scale down blue (keep it for 24h, don't delete)kubectl scale deployment payment-api-blue --replicas=0 -n production# Keeps blue available for instant rollback but not consuming resources # EMERGENCY ROLLBACK (can be done anytime before blue is deleted):kubectl scale deployment payment-api-blue --replicas=4 -n productionkubectl patch service payment-api -n production \ -p '{"spec":{"selector":{"slot":"blue"}}}'# All traffic is back on blue in under 30 secondsBlue-Green vs Rolling Update — When to Use Which
+------------------------------------------+| Use Rolling Update when: || || * Standard feature releases || * Minor bug fixes || * Quick deploys during off-peak || * Database schema is backward compatible || || Rollback time: 2-5 minutes || Traffic impact: zero (gradual swap) |+------------------------------------------+ +------------------------------------------+| Use Blue-Green when: || || * Database schema changes || * Payment flow updates || * Major API version changes || * Releases requiring extensive testing || * You need instant rollback capability || || Rollback time: under 30 seconds || Traffic impact: zero (atomic switch) || Cost: 2x pod count during release window |+------------------------------------------+Monitoring the Switch
# Watch request rates on both deployments during and after switch# In one terminal — monitor blue:kubectl logs -f -l slot=blue -n production | grep -E "GET|POST|ERROR" # In another terminal — monitor green:kubectl logs -f -l slot=green -n production | grep -E "GET|POST|ERROR" # Prometheus query to compare error rates per slot# In Grafana, run:# sum by (slot) (rate(http_requests_total{status=~"5..", namespace="production"}[2m]))# This shows error rate per blue/green slot in real timePLACEMENT PRO TIP**Tip:** Keep the blue deployment scaled to zero (not deleted) for at least 24 hours after switching to green. If a critical bug appears the next morning during business hours, you can scale blue back to 4 replicas and switch the Service back in under 30 seconds — without waiting for a new deployment to roll out. Deleting blue immediately removes this safety net.
REMEMBER THIS**Remember:** Blue-green requires 2x the pod count during the release window. If your blue deployment normally runs 10 pods, you need capacity for 20 pods during the switch. Plan node capacity accordingly — this is where Cluster Autoscaler helps. At Razorpay, blue-green releases are scheduled during off-peak hours specifically to ensure enough node capacity exists without disrupting other services.
COMMON MISTAKE / WARNING**Common Mistake:** Switching the Service before green pods are fully ready. The `readinessProbe` on green pods must be passing on all replicas before you run `kubectl patch service`. If you switch while green pods are still starting up, users immediately hit pods that are not ready — potentially seeing errors or timeouts. Always run `kubectl rollout status deployment/payment-api-green` and confirm `successfully rolled out` before switching the Service.
COMMON MISTAKE / WARNING**Security:** Blue-green deployments double your pod count during the release window. If your namespaces have ResourceQuotas, verify the quota allows twice the normal pod count before starting the deployment. A quota rejection at the moment of green deployment rollout will leave you with no green environment — and a failed deployment you need to clean up manually during what may be a time-sensitive release window.