Rolling Update — The Default Zero-Downtime Release Strategy
What is a Rolling Update in Simple Terms?
A rolling update replaces your old version with a new version one pod at a time, while always keeping enough pods running to serve traffic. Users never see downtime because there is always at least one healthy pod available throughout the entire update process.
Before Kubernetes, deploying a new version meant stopping the old version and starting the new one — causing downtime. Rolling update eliminates this by overlapping the old and new versions. While a new pod starts up and passes health checks, the old pods keep serving traffic. Only once the new pod is confirmed healthy does an old pod get terminated.
Before update — 3 pods on v1:+----------+ +----------+ +----------+| Pod 1 | | Pod 2 | | Pod 3 || v1.0 | | v1.0 | | v1.0 || Serving | | Serving | | Serving |+----------+ +----------+ +----------+ Step 1 — maxSurge=1 creates a new pod:+----------+ +----------+ +----------+ +----------+| Pod 1 | | Pod 2 | | Pod 3 | | Pod 4 || v1.0 | | v1.0 | | v1.0 | | v2.0 || Serving | | Serving | | Serving | | Starting |+----------+ +----------+ +----------+ +----------+ Step 2 — Pod 4 passes readinessProbe, Pod 1 terminated:+----------+ +----------+ +----------+| Pod 2 | | Pod 3 | | Pod 4 || v1.0 | | v1.0 | | v2.0 || Serving | | Serving | | Serving |+----------+ +----------+ +----------+ Step 3 — New pod created for Pod 2, Pod 2 terminated:+----------+ +----------+ +----------+| Pod 3 | | Pod 4 | | Pod 5 || v1.0 | | v2.0 | | v2.0 || Serving | | Serving | | Serving |+----------+ +----------+ +----------+ Step 4 — New pod created for Pod 3, Pod 3 terminated:+----------+ +----------+ +----------+| Pod 4 | | Pod 5 | | Pod 6 || v2.0 | | v2.0 | | v2.0 || Serving | | Serving | | Serving |+----------+ +----------+ +----------+ Update complete — 3 pods on v2, zero downtime throughout.Configuring Rolling Update Strategy
The rolling update configuration lives inside the Deployment spec:
apiVersion: apps/v1kind: Deploymentmetadata: name: payment-api namespace: productionspec: replicas: 6 strategy: type: RollingUpdate rollingUpdate: maxSurge: 2 # Create up to 2 extra pods above desired count # During update: up to 8 pods exist (6 desired + 2 surge) # Speeds up the update (2 pods replaced at once) # Can be an absolute number or a percentage: "33%" maxUnavailable: 0 # Never reduce below desired count # Guarantees 6 pods always serving traffic # NEVER set this above 0 for production services # Can be an absolute number or a percentage template: spec: containers: * name: payment-api image: registry.razorpay.in/payment-api:v3.1.0 readinessProbe: # CRITICAL: controls when new pods receive traffic httpGet: # Rolling update WAITS for this to pass path: /health # before proceeding to the next pod port: 8080 initialDelaySeconds: 15 # Wait 15s before first health check periodSeconds: 5 # Check every 5 seconds successThreshold: 1 # One success = ready failureThreshold: 3 # Three failures = remove from Service livenessProbe: # Controls when Kubernetes restarts a container httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 15 failureThreshold: 3Understanding maxSurge and maxUnavailable
| replicas: 10 | maxSurge: 2 | maxUnavailable: 0|| During update:| Minimum pods: 10 - 0 = 10 (maxUnavailable=0 means never go below 10)| Maximum pods: 10 + 2 = 12 (maxSurge=2 allows 2 extra)|| Update pace: replaces 2 pods at a time (maxSurge=2)| Total time: 5 batches of 2 = faster rollout| Safety: never below 10 pods serving traffic | replicas: 10 | maxSurge: 1 | maxUnavailable: 1|| During update:| Minimum pods: 10 - 1 = 9 (can drop to 9 pods serving traffic)| Maximum pods: 10 + 1 = 11|| This means 10% less capacity during the update| Acceptable only if you have enough headroom in your remaining podsTriggering and Monitoring Rolling Updates
# Trigger a rolling update by changing the image tagkubectl set image deployment/payment-api \ payment-api=registry.razorpay.in/payment-api:v3.2.0 \ -n production # OR update the deployment YAML and apply it# kubectl apply -f deployment.yaml # Watch the rollout in real timekubectl rollout status deployment/payment-api -n production# Waiting for deployment "payment-api" rollout to finish:# 2 out of 6 new replicas have been updated...# 4 out of 6 new replicas have been updated...# 6 out of 6 new replicas have been updated...# Waiting for 3 old replicas to be terminated...# deployment "payment-api" successfully rolled out # Watch individual pods being replacedkubectl get pods -n production -l app=payment-api --watch# NAME READY STATUS# payment-api-7d9f8c-xkp2q 1/1 Running <- old pod# payment-api-7d9f8c-ab1cd 1/1 Running <- old pod# payment-api-6b8d4a-mnp7r 0/1 Pending <- new pod starting# payment-api-6b8d4a-mnp7r 1/1 Running <- new pod ready# payment-api-7d9f8c-xkp2q 1/1 Terminating <- old pod dying # Check the revision history of all rolloutskubectl rollout history deployment/payment-api -n production# REVISION CHANGE-CAUSE# 1 Initial deployment v3.0.0# 2 Updated to v3.1.0# 3 Updated to v3.2.0 # Instantly rollback to the previous versionkubectl rollout undo deployment/payment-api -n production # Rollback to a specific revisionkubectl rollout undo deployment/payment-api \ --to-revision=2 \ -n productionPausing a Rolling Update Mid-Way for Canary Testing
You can pause a rolling update after a few pods have been replaced to test the new version before completing the rollout:
# Start the updatekubectl set image deployment/payment-api \ payment-api=registry.razorpay.in/payment-api:v3.2.0 \ -n production # Immediately pause it after 1-2 pods are replacedkubectl rollout pause deployment/payment-api -n production # Now 1-2 pods are on v3.2.0, rest are still on v3.1.0# Monitor the new pods: check logs, check Grafana error rateskubectl get pods -n production -l app=payment-api# Some pods show the new image, some show the old # If the new pods look good, resume the rolloutkubectl rollout resume deployment/payment-api -n production # If the new pods have problems, rollback before more pods are updatedkubectl rollout undo deployment/payment-api -n productionForce Restart All Pods Without Changing the Image
# Useful when a ConfigMap or Secret was updated# Restarts all pods with rolling update strategy (zero downtime)kubectl rollout restart deployment/payment-api -n production # Watch the restart happenkubectl rollout status deployment/payment-api -n productionTroubleshooting a Stuck Rolling Update
# Rolling update stuck — some pods are in Pending or new pods not becoming Ready # Step 1 — Check why new pods are not becoming Readykubectl get pods -n production -l app=payment-api# Look for pods showing 0/1 Ready for more than 2 minutes # Step 2 — Describe the stuck pod for eventskubectl describe pod payment-api-6b8d4a-mnp7r -n production# Events section will show:# Readiness probe failed: HTTP probe failed with statuscode: 500# <- The readinessProbe is failing — new version has a bug # Step 3 — Check the new pod logskubectl logs payment-api-6b8d4a-mnp7r -n production# Look for startup errors or missing configuration # Step 4 — Rollback immediately to stop the broken version spreadingkubectl rollout undo deployment/payment-api -n production # Step 5 — Verify rollback completedkubectl rollout status deployment/payment-api -n productionREMEMBER THIS**Remember:** The `readinessProbe` is the most important safety mechanism in a rolling update. Kubernetes does NOT replace the next pod until the new pod passes the readiness probe. Without a readiness probe, Kubernetes assumes the container is ready the moment it starts — which may be 30 seconds before your application has finished connecting to the database and warming up its cache. Always define a readiness probe that truly reflects whether the application can handle requests.
PLACEMENT PRO TIP**Tip:** Annotate your deployments with the change reason using `kubernetes.io/change-cause` before each rollout. This makes `kubectl rollout history` show meaningful messages like "v3.2.0 — added UPI retry logic" instead of blank lines. During an incident when you need to rollback to a specific revision, knowing which revision number corresponds to which release is critical.
COMMON MISTAKE / WARNING**Common Mistake:** Setting `maxUnavailable: 25%` (or any non-zero value) for production services that cannot tolerate reduced capacity. During a rolling update with 10 replicas and `maxUnavailable: 25%`, Kubernetes immediately terminates 2-3 old pods before any new pods are ready — briefly serving 70-75% of your normal capacity. During a traffic spike this causes request failures. Always set `maxUnavailable: 0` for any production service with strict availability requirements.
COMMON MISTAKE / WARNING**Security:** Rolling updates work best when your new application version is backward compatible with the running version — both reading from the same database schema, accepting the same request formats, and publishing to the same message queue topics. During the update, both v1 and v2 pods run simultaneously and handle requests. If v2 changes a database schema that v1 cannot read, the rolling update causes data corruption. Always use backward-compatible changes or coordinate database migrations separately from application deployments.