AI coding agents are shipping code faster than ever — but the 2025 DORA report shows incidents per pull request are rising sharply. Here is what that means for your on-call rotation.
Status: DRAFT
Something strange is happening on engineering teams in 2026. Developers are more productive than ever. Sprint velocity is up. Feature delivery is faster. And production incidents are also up — not because teams are less careful, but because the ratio of "code being written" to "humans reviewing it" has fundamentally shifted.
AI coding agents — Cursor, GitHub Copilot, Devin, and their successors — have made it genuinely fast to write code that looks correct, passes lint and tests, and lands in a PR that nobody has the bandwidth to review carefully. The DORA 2025 State of DevOps report found that teams with high AI coding agent adoption saw incidents per pull request increase significantly compared to teams with lower adoption.
The code is moving faster. The incident response capability hasn't moved at the same speed.
The DORA 2025 findings are specific and worth understanding carefully. High-performing teams — those with the most AI-assisted coding — saw an increase in deployment frequency. This part is expected.
What was less expected: their change failure rate did not improve proportionally. Some teams saw it increase. The report's analysis points to a structural reason: AI-generated code tends to solve the immediate problem correctly but miss edge cases, security implications, and interactions with adjacent systems that an experienced human reviewer would catch.
The mental model shifts: a senior engineer manually writing code is also automatically reviewing it. That review happens in their head as they type. When an AI generates the code and the engineer accepts the suggestion, that mental review step is abbreviated or skipped.
This is not an argument against AI coding tools. It is an argument for updating your incident response and review infrastructure to match your new delivery velocity.
Understanding the specific gaps helps you close them systematically.
The review gap: More code per reviewer. If a team was merging 30 PRs per week and is now merging 80, but the review team size is the same, each PR gets less scrutiny. The fix is not "review AI code differently" — it is enforcing automated gates that catch what hurried human review misses.
The context gap: AI-generated code is often syntactically and semantically correct in isolation but missing system-level context. A generated function that correctly implements a payment retry mechanism might not know that your retry logic must respect idempotency keys from your upstream vendor's API contract. The AI doesn't know what it doesn't know.
The blast radius gap: Faster shipping means more changes deployed more frequently. Even if each individual change has the same failure probability, more changes per day means more incidents per week. The math is straightforward — and the solution is not to slow down, but to reduce the blast radius of each change.
The goal is to automate the review steps that are formulaic so humans focus on the review steps that require judgment.
Static analysis catches entire classes of bugs that appear in AI-generated code at higher rates than human-written code: missing error handling, insecure default configurations, hardcoded values that should be environment variables.
## GitHub Actions: mandatory security and quality gatename: AI Code Quality Gateon: pull_request: branches: [main, staging] jobs: automated-review: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 ## need full history for diff analysis - name: Semgrep Security Scan uses: semgrep/semgrep-action@v1 with: config: >- p/owasp-top-ten p/secrets p/nodejs - name: Dependency Vulnerability Check run: npm audit --audit-level=high - name: Test Coverage Check run: | npm test -- --coverage ## fail if coverage drops below 80% npx coverage-check coverage/coverage-summary.json 80 - name: Architecture Boundary Check run: | ## ensure payment service doesn't import from user service npx depcruise --validate .dependency-cruiser.js srcThe architecture boundary check is particularly valuable for AI-generated code — it catches the "I'll just import this utility from another service" shortcuts that create hidden coupling.
The context gap is a prompting and process problem, not a tooling problem.
Teams that get better results from AI coding agents are intentional about providing context. Instead of "write a function to retry failed payments," they provide:
Write a function to retry failed payments. Context:- Payment vendor is Razorpay, API docs: [link]- Idempotency key must be sent as header X-Razorpay-Idempotent-Key- Max 3 retries with 2s, 4s, 8s backoff- Do NOT retry on status 422 (validation errors) or 401 (auth errors)- Retry only on 500, 502, 503, 504- Log each retry attempt with trace_id and vendor_response_codeThat context gets you code that is actually correct for your system. Without it, you get code that looks correct in isolation.
Add a mandatory AI Context section to your PR template:
## AI Context (if AI-assisted)- [ ] AI tool used: [Cursor / Copilot / other]- [ ] Context provided: [brief description]- [ ] Parts I reviewed manually: [what you checked]- [ ] Parts that need human review: [what you are less sure about]This forces engineers to think explicitly about what the AI does and doesn't know, and flags areas where reviewers should focus attention.
The most effective structural response to higher deployment frequency is mandatory progressive delivery. If every deployment goes to 5% of traffic first and auto-rolls back on metric degradation, the blast radius of any individual AI-generated bug is bounded.
This is the technical fix that matches the scale of the problem. Manual review of 80 PRs per week is not sustainable. Automated canary rollouts that catch regressions in production before they reach 100% of traffic scale with your delivery velocity.
## Argo Rollouts canary for any service with AI-assisted deliveryspec: strategy: canary: steps: - setWeight: 5 - pause: duration: 5m - analysis: templates: - templateName: ai-code-safety-check - setWeight: 25 - pause: duration: 10m - setWeight: 100The ai-code-safety-check analysis template checks not just infrastructure metrics but business metrics: did order completion rate change, did payment success rate change, did the P99 latency of downstream services change. Business metrics catch context-gap failures that infrastructure metrics miss.
Higher deployment frequency and higher change failure rate means your on-call rotation needs to handle more incidents. This is not a call to expand your on-call team — it is a call to make each incident faster to resolve.
Three changes that matter most:
Pre-written runbooks for AI-generated code patterns: The most common failure modes from AI-generated code are predictable — missing null checks, incorrect retry semantics, wrong error code handling. Write runbooks for these patterns before they cause incidents.
Deployment-correlated alerting: Every deployment event should automatically be visible in your incident management dashboard. When an alert fires within 30 minutes of a deployment, the first diagnostic hypothesis should always be "is this deployment-related?" AI code moves faster, so the correlation window should be shorter, not longer.
Automatic rollback triggers: If an alert fires and a deployment happened in the last 30 minutes, rollback should be the default first response — not root cause analysis. Restore service first, investigate second. This requires runbooks that explicitly say "rollback is always safe if the change is less than 30 minutes old."
A team that has correctly adapted its incident response to AI-assisted development velocity looks like this:
Every PR has automated security, quality, and architecture gates that cannot be bypassed. Engineers provide structured context when using AI coding tools. Every deployment goes through a canary rollout with automated metric analysis. On-call engineers have pre-written runbooks for common AI-generated failure patterns. Rollback is the default first response to any deployment-correlated incident.
The teams struggling are the ones that adopted AI coding tools at the speed of "install the plugin" and left their review, deployment, and incident response infrastructure unchanged. The code velocity increased. The safety infrastructure did not.
Measure incidents per deployment, not just incidents per week. This is the metric that reveals the impact of AI-assisted coding on incident rates — raw incident count can be stable even as change failure rate climbs if deployment frequency is also climbing.
Add AI tool usage to your incident postmortem template. "Was AI assistance used in the code that caused this incident?" is a question that generates data about where your context gap is largest. After 20 postmortems, patterns emerge: specific types of changes, specific AI tools, or specific domains where the context gap is most costly.
Don't ban AI coding tools. Teams that try to respond to rising incident rates by restricting AI tool usage see developer productivity drop without seeing meaningful incident rate improvement — because the root cause was insufficient automated gates, not the AI tool itself.
| Response to AI-Driven Incidents | Effort | Effectiveness |
|---|---|---|
| Restrict AI tool usage | Low | Very low |
| Increase manual review | High | Medium |
| Automate quality gates | Medium | High |
| Canary all deployments | Medium | Very high |
| AI-assisted incident response | High | Very high |
The highest-leverage combination: automated quality gates in CI plus canary deployments. These two changes address the review gap and the blast radius gap without asking humans to manually review 80 PRs per week.
INFORMATION📚 **References & Further Reading** * [DORA State of DevOps 2025](https://dora.dev/research/) - Source data on AI coding and incident rates * [Argo Rollouts](https://argoproj.github.io/argo-rollouts/) - Canary deployment implementation * [Semgrep Rules](https://semgrep.dev/r) - Open-source SAST rules for automated review * [GitHub Copilot Research](https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/) - Productivity data on AI coding tools
Define a custom AnalysisTemplate that queries business metrics — order completion rate, payment success rate, or checkout funnel conversion — from Prometheus using instrumented application counters. Set the successCondition to compare the canary metric against the stable baseline using a percentage threshold rather than an absolute value, so the rollback logic scales with traffic volume rather than failing on quiet periods.
Use dependency-cruiser with a .dependency-cruiser.js ruleset that defines forbidden import paths between service modules. Run it as a mandatory CI step that exits with a non-zero code when violations are detected. For monorepos, pair with Nx affected graph analysis to only validate modules touched by the PR rather than scanning the entire codebase on every commit.
Discussion0