MTTR (Mean Time to Restore)
MTTR is one of the four DORA metrics. It measures the average time it takes to restore service after a production incident — most commonly an incident caused by a deployment. It is a stability metric: a low MTTR means that when something does break, the organisation recovers quickly, which matters because some failure rate above zero is unavoidable at scale — the real differentiator between elite and low-performing teams is how fast they bounce back, not whether failures happen at all.
The Four Components of MTTR
+------------------------------------------+| Incident starts: deploy breaks service | <- t0+------------------------------------------+ | v+------------------------------------------+| Detection: alert fires or report comes | <- t1+------------------------------------------+ | v+------------------------------------------+| Diagnosis: on-call identifies the cause | <- t2+------------------------------------------+ | v+------------------------------------------+| Rollback or fix applied | <- t3+------------------------------------------+ | v+------------------------------------------+| Verification: service confirmed healthy | <- t4 = restored+------------------------------------------+- Detection — the time between the incident actually starting and someone (or some alert) noticing it.
- Diagnosis — the time between noticing something is wrong and identifying which deployment or change actually caused it.
- Rollback or fix — the time it takes to execute the recovery, whether
that is
kubectl rollout undo, an ArgoCD sync to a previous revision, or a feature flag flip. - Verification — the time it takes to confirm the service is actually healthy again, not just that the rollback command exited successfully.
Benchmark Bands
| Tier | MTTR |
|---|---|
| Elite | Less than one hour |
| High | Less than one day |
| Medium | Between one day and one week |
| Low | More than six months |
What Makes MTTR Long
- Missing or noisy alerts — if detection depends on a customer noticing and filing a support ticket, the detection phase alone can consume most of the incident's duration.
- Slow or manual rollback — if recovering means SSHing into a server
and manually reverting a config file, rollback time dominates. Automated
rollback (
kubectl rollout undo, ArgoCD revision sync) collapses this phase to minutes. - Unclear runbooks — an on-call engineer who has to figure out from scratch which dashboard to check and which command to run loses time during diagnosis that a clear, tested runbook would have saved.
- No clear rollback owner during the incident — diffused responsibility ("is someone handling this?") delays the rollback decision itself, independent of how fast the rollback mechanism is.
How to Measure MTTR
SELECT AVG(EXTRACT(EPOCH FROM (resolved_at - opened_at)) / 60) AS mttr_minutesFROM incidentsWHERE opened_at > NOW() - INTERVAL '30 days' AND caused_by_deployment_id IS NOT NULL;Use the median alongside the average — a single very long incident (a multi-hour database migration gone wrong) can pull the average MTTR far above what a typical incident actually looks like.
The Relationship Between Rollback Speed and MTTR
The single highest-leverage investment most teams can make to reduce MTTR is removing the human from the rollback decision path entirely for the common case. A CRED-style platform team running Argo Rollouts with automated analysis can detect an error-rate spike and roll back within minutes, fully automatically — versus an on-call engineer who has to be paged, wake up, open a laptop, and manually run the rollback command, which routinely adds 15 to 30 minutes before the rollback even starts.
PLACEMENT PRO TIP**Tip:** Rehearse rollbacks in staging on a schedule (a lightweight form of chaos engineering), not just write a runbook and hope it works when it matters. A rollback procedure no one has actually run before real pressure is a leading cause of long diagnosis-and-fix phases.
REMEMBER THIS**Remember:** MTTR is measured from when the incident actually *started* (when the bad deploy went live), not from when someone opened a ticket in the incident tracker — those two timestamps can be very different if detection is slow.
COMMON MISTAKE / WARNING**Security:** During an active incident, the pressure to restore service fast can lead to bypassing normal access controls (an engineer using a personal admin credential instead of the proper deploy pipeline). Keep a documented, pre-approved "break glass" procedure for emergency access so speed during an incident does not come at the cost of an unaudited access path.
COMMON MISTAKE / WARNING**Common Mistake:** Optimising only the rollback mechanism's speed while leaving detection slow. A rollback that takes ninety seconds does not help if it takes forty minutes for anyone to notice the deploy broke something in the first place — detection and rollback speed both need investment.
Troubleshooting Reference
| Symptom | Check | What to Look For |
|---|---|---|
| MTTR much higher than expected | Break the incident down into its four phases | Detection or diagnosis phase dominating, not the actual rollback |
| MTTR skewed very high by one incident | SELECT MAX(resolved_at - opened_at) FROM incidents |
A single long-running incident (often a data issue, not a simple rollback) pulling the average up |
| MTTR not improving despite faster rollback tooling | Time from alert firing to rollback command execution | On-call engineer paged but slow to start the rollback, indicating a runbook or ownership gap |