What is MTTR? | DevOps Dictionary

MTTR (Mean Time to Restore)

MTTR is one of the four DORA metrics. It measures the average time it takes to restore service after a production incident — most commonly an incident caused by a deployment. It is a stability metric: a low MTTR means that when something does break, the organisation recovers quickly, which matters because some failure rate above zero is unavoidable at scale — the real differentiator between elite and low-performing teams is how fast they bounce back, not whether failures happen at all.

The Four Components of MTTR

◈ DIAGRAM

+------------------------------------------+
| Incident starts: deploy breaks service   | <- t0
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Detection: alert fires or report comes   | <- t1
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Diagnosis: on-call identifies the cause  | <- t2
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Rollback or fix applied                  | <- t3
+------------------------------------------+
                      |
                      v
+------------------------------------------+
| Verification: service confirmed healthy  | <- t4 = restored
+------------------------------------------+

Detection — the time between the incident actually starting and someone (or some alert) noticing it.
Diagnosis — the time between noticing something is wrong and identifying which deployment or change actually caused it.
Rollback or fix — the time it takes to execute the recovery, whether that is kubectl rollout undo, an ArgoCD sync to a previous revision, or a feature flag flip.
Verification — the time it takes to confirm the service is actually healthy again, not just that the rollback command exited successfully.

Benchmark Bands

Tier	MTTR
Elite	Less than one hour
High	Less than one day
Medium	Between one day and one week
Low	More than six months

What Makes MTTR Long

Missing or noisy alerts — if detection depends on a customer noticing and filing a support ticket, the detection phase alone can consume most of the incident's duration.
Slow or manual rollback — if recovering means SSHing into a server and manually reverting a config file, rollback time dominates. Automated rollback (kubectl rollout undo, ArgoCD revision sync) collapses this phase to minutes.
Unclear runbooks — an on-call engineer who has to figure out from scratch which dashboard to check and which command to run loses time during diagnosis that a clear, tested runbook would have saved.
No clear rollback owner during the incident — diffused responsibility ("is someone handling this?") delays the rollback decision itself, independent of how fast the rollback mechanism is.

How to Measure MTTR

SQL

SELECT AVG(EXTRACT(EPOCH FROM (resolved_at - opened_at)) / 60) AS mttr_minutes
FROM incidents
WHERE opened_at > NOW() - INTERVAL '30 days'
  AND caused_by_deployment_id IS NOT NULL;

Use the median alongside the average — a single very long incident (a multi-hour database migration gone wrong) can pull the average MTTR far above what a typical incident actually looks like.

The Relationship Between Rollback Speed and MTTR

The single highest-leverage investment most teams can make to reduce MTTR is removing the human from the rollback decision path entirely for the common case. A CRED-style platform team running Argo Rollouts with automated analysis can detect an error-rate spike and roll back within minutes, fully automatically — versus an on-call engineer who has to be paged, wake up, open a laptop, and manually run the rollback command, which routinely adds 15 to 30 minutes before the rollback even starts.

PLACEMENT PRO TIP
**Tip:** Rehearse rollbacks in staging on a schedule (a lightweight form of chaos engineering), not just write a runbook and hope it works when it matters. A rollback procedure no one has actually run before real pressure is a leading cause of long diagnosis-and-fix phases.

REMEMBER THIS
**Remember:** MTTR is measured from when the incident actually *started* (when the bad deploy went live), not from when someone opened a ticket in the incident tracker — those two timestamps can be very different if detection is slow.

COMMON MISTAKE / WARNING
**Security:** During an active incident, the pressure to restore service fast can lead to bypassing normal access controls (an engineer using a personal admin credential instead of the proper deploy pipeline). Keep a documented, pre-approved "break glass" procedure for emergency access so speed during an incident does not come at the cost of an unaudited access path.

COMMON MISTAKE / WARNING
**Common Mistake:** Optimising only the rollback mechanism's speed while leaving detection slow. A rollback that takes ninety seconds does not help if it takes forty minutes for anyone to notice the deploy broke something in the first place — detection and rollback speed both need investment.

Troubleshooting Reference

Symptom	Check	What to Look For
MTTR much higher than expected	Break the incident down into its four phases	Detection or diagnosis phase dominating, not the actual rollback
MTTR skewed very high by one incident	`SELECT MAX(resolved_at - opened_at) FROM incidents`	A single long-running incident (often a data issue, not a simple rollback) pulling the average up
MTTR not improving despite faster rollback tooling	Time from alert firing to rollback command execution	On-call engineer paged but slow to start the rollback, indicating a runbook or ownership gap