Overview and What You Will Learn
This guide covers the full set of rollback mechanisms available once a
deployment has caused a problem in production -- from the immediate
kubectl rollout undo emergency button, to ArgoCD's Git-revision-based
rollback, to Argo Rollouts' fully automated rollback on metric failure, to
the genuinely hard problem of rolling back a database migration. You will
also get a decision framework for choosing rollback versus rolling forward,
and a runbook structure for executing a rollback calmly during an incident
rather than improvising one under pressure.
Why This Matters in Production
Rollback speed is the single biggest lever on MTTR, and MTTR is the metric that determines how much customer-facing damage a bad deploy actually does. A team that can detect a bad deploy and revert it in two minutes has an incident that barely shows up in uptime numbers; a team where rollback means paging someone who then has to remember the right sequence of manual commands can turn the same root cause into a thirty-minute outage. Fintech platforms like Razorpay design specifically around fast, well-rehearsed rollback because the cost of a slow recovery during a payment-processing incident compounds every minute it continues.
COMMON MISTAKE / WARNING**Common Mistake:** Treating rollback as something to figure out during the incident itself. The first time a team discovers their `kubectl rollout undo` does not actually work because of a missing RBAC permission, or that the ArgoCD revision history was pruned past what they needed, should be in a rehearsal -- not during a real outage at 3am.
Core Principles
Rollback decision tree
+------------------------------------------+| Deployment caused a failure: choose || method |+------------------------------------------+ | v+------------------------------------------+| Feature flag covers the change? Flip || flag |+------------------------------------------+ | v+------------------------------------------+| No flag: ArgoCD sync to prior Git || revision |+------------------------------------------+ | v+------------------------------------------+| No GitOps: kubectl rollout undo directly |+------------------------------------------+ | v+------------------------------------------+| DB migration involved? Apply down || migration |+------------------------------------------+kubectl rollout undo: the emergency button
kubectl rollout undo deployment/order-gateway -n order-gateway-prodkubectl rollout status deployment/order-gateway -n order-gateway-prodThis reverts to the previous ReplicaSet directly in the cluster. It is fast and requires no Git access, but it creates exactly the kind of GitOps drift discussed elsewhere in this hub -- if a GitOps agent with self-heal enabled is also watching this Deployment, it will revert your rollback back to the broken version on its next reconciliation pass.
COMMON MISTAKE / WARNING**Security:** `kubectl rollout undo` requires direct cluster write access. If your standard deploy path is GitOps-only with no human holding that access day-to-day, decide in advance who has emergency kubectl credentials and how that access is granted and audited -- not during the incident itself.
ArgoCD rollback: sync to a previous Git revision
The GitOps-native rollback path reverts the commit in the config repo and lets the existing automated sync apply it -- consistent with how the deployment was made in the first place, so it does not fight self-healing.
git -C config-repo revert <bad-commit-sha> --no-editgit -C config-repo push# or, without waiting for a new commit, sync ArgoCD directly to a prior revisionargocd app sync order-gateway-production --revision <previous-good-sha>PLACEMENT PRO TIP**Tip:** Prefer `git revert` over `git reset --hard` when rolling back a config repo -- revert preserves history and creates a new commit that is itself auditable, while a hard reset rewrites history and can conflict with branch protection rules.
Argo Rollouts: automated rollback on metric failure
As covered in the canary deployments topic, an AnalysisTemplate checking error rate or latency against Prometheus can trigger an automatic abort and revert with zero human involvement -- the fastest possible rollback path, because it starts the moment the metric crosses the threshold rather than the moment a human notices.
Database migration rollback: the hard problem
Application code rollback is close to free -- the old container image still exists and can be redeployed instantly. Database schema changes are not symmetric: a migration that drops a column cannot be undone without losing whatever data was written to that column in the meantime.
- Forward-only migrations -- the standard mitigation is to never write a migration that is unsafe to leave half-applied: add new columns as nullable first, deploy code that writes to both old and new columns, backfill, then only drop the old column in a later, separate migration once the new code path has been running safely for a while.
- If a migration genuinely cannot be made backward-compatible, the rollback plan for that specific deploy needs to be written and reviewed before the migration runs, not improvised afterward.
REMEMBER THIS**Remember:** "Roll back the application code" and "roll back the database" are two different operations that do not automatically happen together. Confirm explicitly whether a given deploy's database changes are safe to leave in place if you roll back only the application layer.
Feature flag rollback: the fastest option
If the risky change is gated behind a feature flag, rollback is flipping the flag off -- no deployment, no pipeline run, often propagating to all instances within seconds. This is why high-risk changes are frequently shipped flag-off-by-default, with the flag flipped on gradually after the code itself has already been deployed and is sitting inert.
Rollback vs roll-forward decision framework
- If the previous version is known-good and the fix is not yet ready, rollback.
- If rolling back would also require undoing an unsafe database change, and a forward fix is faster to write and verify than reasoning through the data implications of a rollback, roll forward.
- If the issue is isolated to a feature flag's blast radius, flip the flag rather than touching the deployment at all.
Detailed Step-by-Step Practical Lab
This lab builds and rehearses a rollback runbook for Zerodha's
order-gateway service across all four rollback mechanisms.
Milestone 1 — Confirm kubectl rollout history depth
kubectl rollout history deployment/order-gateway -n order-gateway-prodAt this point you know how many previous revisions are actually available
to roll back to -- if the history is too shallow, increase
revisionHistoryLimit on the Deployment before you need it in an incident.
Milestone 2 — Rehearse kubectl rollout undo in staging
kubectl -n order-gateway-staging set image deployment/order-gateway \ order-gateway=...:deliberately-broken-tagkubectl rollout status deployment/order-gateway \ -n order-gateway-staging --timeout=30s || \ kubectl rollout undo deployment/order-gateway -n order-gateway-stagingAt this point the team has timed exactly how long a manual rollback takes end to end, in a low-stakes environment, before relying on it in production.
Milestone 3 — Rehearse the ArgoCD Git-revert path
git -C config-repo revert HEAD --no-edit && git -C config-repo pushargocd app get order-gateway-staging --watchAt this point you have confirmed the revert commit actually triggers a sync and the cluster converges back to the prior state.
Milestone 4 — Verify Argo Rollouts automated rollback triggers correctly
Re-run the canary failure test from the blue-green and canary topic and confirm the rollback completes without anyone running a manual command.
kubectl argo rollouts get rollout order-gateway -n order-gateway-stagingAt this point you have evidence the automated path works, which is the rollback mechanism that will fire fastest in a real incident.
Milestone 5 — Document the database migration rollback plan template
Migration: add_kyc_status_columnForward: add nullable column, backfill, default value applied asyncRollback-safe: yes - column unused by old code path, safe to leaveRollback action if needed: none required, no app rollback dependencyAt this point every migration in the deploy has an explicit rollback- safety note attached, reviewed alongside the migration itself.
Milestone 6 — Write the incident rollback runbook
1. Confirm which deployment/commit is suspected2. Check: is this gated by a feature flag? Flip it off first3. If GitOps-managed: git revert config repo commit, push, watch sync4. If urgent and GitOps sync is too slow: kubectl rollout undo directly, then also revert in Git to avoid self-heal fighting the fix5. Verify service health before declaring the incident resolved6. File the post-incident note: what broke, which rollback path, timingAt this point the runbook exists as a reviewed document, not something reconstructed from memory during the next real incident.
Production Best Practices & Common Pitfalls
- Set
revisionHistoryLimithigh enough that an emergencykubectl rollout undoalways has somewhere useful to revert to. - Always pair a direct kubectl rollback with a matching Git revert if the service is GitOps-managed, or self-healing will silently undo your fix.
- Review every migration's rollback safety at PR time, not after an incident reveals it was never considered.
- Rehearse rollback in staging on a fixed cadence -- a procedure no one has actually run is a procedure you cannot trust the timing of.
- Default high-risk changes to a feature flag, off by default, so the fastest rollback option exists before you need it, not after.
Quick Reference & Troubleshooting Commands
| Symptom | Command | What to Look For |
|---|---|---|
| rollout undo has nowhere to revert to | kubectl rollout history deployment/order-gateway |
revisionHistoryLimit set too low, history already pruned |
| Manual rollback keeps getting reverted | argocd app get <name> -o json | jq .spec.syncPolicy |
Self-heal active, fighting a fix made outside Git |
| Rollback succeeded but errors continue | Check whether DB migration also needs addressing | App-layer rollback alone insufficient for this change |
| Argo Rollouts rollback never triggers | kubectl describe analysisrun <name> |
AnalysisTemplate query misconfigured, never evaluating |