What is Change Failure Rate? | DevOps Dictionary

Change Failure Rate

Change Failure Rate (CFR) is one of the four DORA metrics. It measures the percentage of deployments to production that result in a degraded service, a customer-facing incident, a required hotfix, or a rollback. It is a stability metric — the counterbalance to Deployment Frequency, since shipping often only counts as elite performance if most of those ships land safely.

The Calculation

◈ DIAGRAM

+----------------------+      +----------------------+
| FAILED CHANGES       |      | TOTAL CHANGES        |
|                      |      |                      |
| 12 deploys caused    |      | 80 total deploys     |
| an incident or rollback |      | in the same window   |
+----------------------+      +----------------------+

TEXT

Change Failure Rate = (deployments causing a failure / total deployments) x 100

In the example diagram above, 12 failed deploys out of 80 total deploys gives a Change Failure Rate of 15% for that window.

What Counts as a Failure

A deployment counts toward the failure rate if it directly caused any of:

A production incident requiring a declared response.
A hotfix deployed specifically to address something the deploy broke.
A rollback to the previous version.
A significant degradation noticed by monitoring or customers, even if no formal incident was opened.

A deploy that succeeds technically but later needs a follow-up feature change is not a failure by this definition — only deploys that broke something count.

Benchmark Bands

Tier	Change Failure Rate
Elite	0% to 15%
High	16% to 30%
Medium / Low	Higher, with growing reliance on manual remediation

Measuring Change Failure Rate

The hard part of this metric is not the arithmetic — it is reliably linking each incident back to the deployment that caused it. The cleanest approach is to stamp a deployment ID into the running service (an environment variable, a /version endpoint) so that whoever opens an incident can record exactly which deploy was live when it started.

SQL

SELECT
  COUNT(DISTINCT i.id) * 100.0 / NULLIF(COUNT(DISTINCT d.id), 0) AS change_failure_rate_pct
FROM deployments d
LEFT JOIN incidents i ON i.caused_by_deployment_id = d.id
WHERE d.created_at > NOW() - INTERVAL '30 days';

Common Causes of High Change Failure Rate

Insufficient automated test coverage — bugs that should have been caught pre-merge instead surface in production.
Staging environment fidelity gaps — staging running different data volumes, different config, or a different version of a downstream dependency than production, so tests pass in staging and fail in prod.
Skipped or advisory-only scan gates — a SAST or dependency scan stage configured to warn instead of block.
Manual deployment steps — any step a human runs by hand (a manual database migration, a manual config toggle) is a step that occasionally gets done wrong or forgotten.

Strategies to Reduce CFR

A team at PhonePe driving CFR down typically invests in three places at once: closing the gap between staging and production environment parity, raising automated test coverage on the code paths most often implicated in past incidents, and making canary or blue-green deployments the default rather than the exception so a bad change is caught on a small slice of traffic before it reaches everyone.

PLACEMENT PRO TIP
**Tip:** Track Change Failure Rate per service, not just org-wide. One chronically unstable service can hide behind a healthy org-wide average while quietly causing most of the customer-facing pain.

REMEMBER THIS
**Remember:** A near-zero Change Failure Rate is not automatically good — if it is achieved by deploying extremely rarely after exhaustive manual QA, you have traded stability for speed in a way that usually shows up as a poor Deployment Frequency and Lead Time instead.

COMMON MISTAKE / WARNING
**Security:** Be cautious about who can edit the `caused_by_deployment_id` linkage on an incident. If this metric is tied to team evaluations, there is a structural incentive to under-link incidents to deployments — keep the linkage automated at incident creation time wherever possible.

COMMON MISTAKE / WARNING
**Common Mistake:** Counting only incidents that paged on-call as failures, while ignoring deploys that silently degraded a non-critical path with no alert firing. This systematically undercounts the real failure rate and hides exactly the kind of slow-burning regression that canary analysis is designed to catch.

Troubleshooting Reference

Symptom	Check	What to Look For
CFR reads as zero	Query the incident-to-deployment join	`caused_by_deployment_id` never populated on real incidents
CFR spikes after a specific service's deploys	Break CFR down per service	One unstable service skewing the org-wide average
CFR looks good but customers report more issues	Compare CFR to support ticket volume over the same window	Incidents not formally opened for smaller, unpaged degradations