Change Failure Rate
Change Failure Rate (CFR) is one of the four DORA metrics. It measures the percentage of deployments to production that result in a degraded service, a customer-facing incident, a required hotfix, or a rollback. It is a stability metric ā the counterbalance to Deployment Frequency, since shipping often only counts as elite performance if most of those ships land safely.
The Calculation
+----------------------+ +----------------------+| FAILED CHANGES | | TOTAL CHANGES || | | || 12 deploys caused | | 80 total deploys || an incident or rollback | | in the same window |+----------------------+ +----------------------+Change Failure Rate = (deployments causing a failure / total deployments) x 100In the example diagram above, 12 failed deploys out of 80 total deploys gives a Change Failure Rate of 15% for that window.
What Counts as a Failure
A deployment counts toward the failure rate if it directly caused any of:
- A production incident requiring a declared response.
- A hotfix deployed specifically to address something the deploy broke.
- A rollback to the previous version.
- A significant degradation noticed by monitoring or customers, even if no formal incident was opened.
A deploy that succeeds technically but later needs a follow-up feature change is not a failure by this definition ā only deploys that broke something count.
Benchmark Bands
| Tier | Change Failure Rate |
|---|---|
| Elite | 0% to 15% |
| High | 16% to 30% |
| Medium / Low | Higher, with growing reliance on manual remediation |
Measuring Change Failure Rate
The hard part of this metric is not the arithmetic ā it is reliably
linking each incident back to the deployment that caused it. The cleanest
approach is to stamp a deployment ID into the running service (an
environment variable, a /version endpoint) so that whoever opens an
incident can record exactly which deploy was live when it started.
SELECT COUNT(DISTINCT i.id) * 100.0 / NULLIF(COUNT(DISTINCT d.id), 0) AS change_failure_rate_pctFROM deployments dLEFT JOIN incidents i ON i.caused_by_deployment_id = d.idWHERE d.created_at > NOW() - INTERVAL '30 days';Common Causes of High Change Failure Rate
- Insufficient automated test coverage ā bugs that should have been caught pre-merge instead surface in production.
- Staging environment fidelity gaps ā staging running different data volumes, different config, or a different version of a downstream dependency than production, so tests pass in staging and fail in prod.
- Skipped or advisory-only scan gates ā a SAST or dependency scan stage configured to warn instead of block.
- Manual deployment steps ā any step a human runs by hand (a manual database migration, a manual config toggle) is a step that occasionally gets done wrong or forgotten.
Strategies to Reduce CFR
A team at PhonePe driving CFR down typically invests in three places at once: closing the gap between staging and production environment parity, raising automated test coverage on the code paths most often implicated in past incidents, and making canary or blue-green deployments the default rather than the exception so a bad change is caught on a small slice of traffic before it reaches everyone.
PLACEMENT PRO TIP**Tip:** Track Change Failure Rate per service, not just org-wide. One chronically unstable service can hide behind a healthy org-wide average while quietly causing most of the customer-facing pain.
REMEMBER THIS**Remember:** A near-zero Change Failure Rate is not automatically good ā if it is achieved by deploying extremely rarely after exhaustive manual QA, you have traded stability for speed in a way that usually shows up as a poor Deployment Frequency and Lead Time instead.
COMMON MISTAKE / WARNING**Security:** Be cautious about who can edit the `caused_by_deployment_id` linkage on an incident. If this metric is tied to team evaluations, there is a structural incentive to under-link incidents to deployments ā keep the linkage automated at incident creation time wherever possible.
COMMON MISTAKE / WARNING**Common Mistake:** Counting only incidents that paged on-call as failures, while ignoring deploys that silently degraded a non-critical path with no alert firing. This systematically undercounts the real failure rate and hides exactly the kind of slow-burning regression that canary analysis is designed to catch.
Troubleshooting Reference
| Symptom | Check | What to Look For |
|---|---|---|
| CFR reads as zero | Query the incident-to-deployment join | caused_by_deployment_id never populated on real incidents |
| CFR spikes after a specific service's deploys | Break CFR down per service | One unstable service skewing the org-wide average |
| CFR looks good but customers report more issues | Compare CFR to support ticket volume over the same window | Incidents not formally opened for smaller, unpaged degradations |