Overview and What You Will Learn
Drift happens when an engineer logs into the AWS console at 2am during an incident and manually bumps a security group rule, or widens an RDS instance's storage, just to make the immediate problem go away. The fix works. The incident closes. But now your Terraform configuration still describes the OLD rule, and your Terraform state still believes the OLD rule is what is deployed. Reality moved. Terraform's understanding of reality did not. That gap is drift — and left unmanaged, it quietly grows until nobody trusts terraform plan output anymore because it is always full of unexplained changes nobody asked for.
By the end of this topic you will be able to:
- Explain exactly how and why drift happens, with real incident-response scenarios
- Use terraform plan as your first and fastest drift detection tool
- Understand what -refresh-only does and how it differs from a full apply
- Set up scheduled, automated drift detection in CI
- Use Driftctl as a dedicated, more comprehensive drift detection tool
- Decide, case by case, whether to reconcile drift back to your config or update your config to match reality
- Prevent drift at the source with IAM policies that block console changes on Terraform-managed resources
- Follow a clear incident response procedure the next time drift happens in production
Why This Matters in Production
At Hotstar, during a high-traffic cricket match, an on-call engineer noticed an ALB target group was returning unhealthy targets and manually changed the health check path directly in the AWS console to stop the bleeding — a completely reasonable, correct call under incident pressure. Three weeks later, a routine terraform plan on an unrelated PR showed an unexpected change to that same health check path, with no explanation in the diff. The reviewer almost approved it as probably fine before someone traced it back to the 2am incident fix that nobody had ever told Terraform about. Without that trace, the PR would have silently reverted a deliberate, still-needed fix back to the old broken health check — right before the next big match.
Drift is not just a technical problem. It is a trust problem. When your team stops believing terraform plan output, you have lost the most important safety net in your infrastructure workflow.
Core Principles
The Drift Lifecycle
+------------------------------------------+| Manual change made OUTSIDE Terraform | <- console, CLI, or another tool+------------------------------------------+ | v+------------------------------------------+| Real infra now differs from state file | <- Terraform does not know yet+------------------------------------------+ | v+------------------------------------------+| terraform plan reads real infrastructure | <- drift is detected here| and shows an unexpected diff |+------------------------------------------+ | +-----------+-----------+ | | v v+------------------+ +----------------------+| RECONCILE | | UPDATE CONFIG || terraform apply | | Update .tf to match || reverts to match | | reality, then || the .tf config | | terraform apply |+------------------+ +----------------------+What Drift Is Not
Drift is not the same as a bug in your Terraform code. Drift means real infrastructure diverged from what Terraform believes is deployed. Your code might be perfectly correct — the problem is that someone changed reality without going through Terraform.
REMEMBER THIS**Remember:** terraform plan always refreshes its view of real infrastructure before comparing to your config. Every time you run plan, you are also running drift detection — whether you intended to or not.
Detailed Step-by-Step Practical Lab
Step 1 — Detect Drift with terraform plan
terraform plan is your default drift detector. It reads real infrastructure state from the cloud provider API before comparing anything:
terraform plan # Output when drift is detected:# Terraform detected the following changes made outside of Terraform since the last# "terraform apply":## ~ resource "aws_security_group_rule" "app_ingress" {# ~ from_port = 443 -> 8443 # someone changed this in the console# }## Unless you have made equivalent changes to your configuration, or ignored the# relevant attributes using ignore_changes, the following plan may include actions# to undo or respond to these changes.The phrase "Terraform detected the following changes made outside of Terraform" is explicit. Terraform is telling you: this diff was not caused by your code — reality moved without you.
PLACEMENT PRO TIP**Tip:** Read every plan output top to bottom before approving. The drift section appears ABOVE the planned changes section. Engineers who skim plans often miss it entirely.
Step 2 — Update State Only, Without Changing Infrastructure
If the manual change was intentional and correct — like the Hotstar health check fix — use -refresh-only to acknowledge it in state without touching real infrastructure:
terraform apply -refresh-only # Terraform shows the same drift diff, but asks a different question:# "Do you want to update the Terraform state to reflect these detected changes?"## Answering yes:# - Updates ONLY the state file# - Makes ZERO API calls to AWS# - Does NOT change any real resources# - Does NOT revert the manual change # Apply complete! Resources: 0 added, 0 changed, 0 destroyed.# (State was updated to reflect the manual changes.)COMMON MISTAKE / WARNING**Security:** Running -refresh-only without then updating your .tf file leaves your code out of sync with state. The next engineer who reads the .tf file will see the wrong config. Always follow a -refresh-only apply with a matching .tf update and PR.
Step 3 — Reconcile Drift Back to Your Configuration
If the manual change was wrong, or a temporary emergency fix you do not want to keep, a normal terraform apply reverts reality back to match your code:
# First, confirm what drift existsterraform plan# ~ aws_security_group_rule.app_ingress will be updated in-place# ~ from_port = 8443 -> 443 <- reverting the manual change # Apply reverts the real security group back to port 443terraform apply# aws_security_group_rule.app_ingress: Modifying...# aws_security_group_rule.app_ingress: Modifications complete after 2s# Apply complete! Resources: 0 added, 1 changed, 0 destroyed.Step 4 — Automate Scheduled Drift Detection in CI
Do not wait for an unrelated PR to stumble onto drift. Run a scheduled read-only plan against production every day:
# .github/workflows/drift-detection.ymlname: Scheduled Drift Detection on: schedule: - cron: "0 6 * * *" # runs every day at 6am UTC workflow_dispatch: # also allows manual trigger jobs: detect-drift: runs-on: ubuntu-latest environment: production # uses the same OIDC role as your apply workflow permissions: id-token: write # required for OIDC contents: read steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: "1.9.0" - name: Configure AWS credentials via OIDC uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789012:role/terraform-drift-readonly aws-region: ap-south-1 - name: Terraform Init run: terraform init working-directory: environments/prod - name: Detect Drift id: plan run: | # -detailed-exitcode returns: # exit 0 = no changes (no drift) # exit 1 = error # exit 2 = changes detected (drift found) terraform plan -detailed-exitcode -out=drift.tfplan 2>&1 | tee plan_output.txt working-directory: environments/prod continue-on-error: true # do not fail the job on exit code 2 - name: Alert on Drift if: steps.plan.outputs.exitcode == '2' run: | # Post a Slack alert when drift is found curl -s -X POST "${{ secrets.SLACK_WEBHOOK_URL }}" \ -H 'Content-type: application/json' \ --data "{ \"text\": \":rotating_light: *Drift detected in prod!* Terraform plan shows unexpected changes.\\nRun ID: ${{ github.run_id }}\\nCheck the workflow for details.\" }" - name: Fail job if drift found if: steps.plan.outputs.exitcode == '2' run: exit 1 # makes the scheduled job red in GitHub UIPLACEMENT PRO TIP**Tip:** Create a read-only IAM role specifically for the drift detection job. It only needs `terraform plan` permissions — not apply. This limits blast radius if the CI credentials are ever compromised.
Step 5 — Use Driftctl for Deeper Detection
terraform plan only detects drift on resources that exist in your .tf files. Driftctl also finds resources that were created in your AWS account and never touched by Terraform at all:
# Install driftctlcurl -sL https://raw.githubusercontent.com/snyk/driftctl/main/install.sh | bash # Authenticate to AWS (same credentials as Terraform)export AWS_PROFILE=mumbai-prod # Scan your AWS account against your Terraform statedriftctl scan --from tfstate+s3://razorpay-terraform-state/prod/terraform.tfstate # Example output:# Scanned resources: (142)# Found 9 resource(s) not under Terraform management (unmanaged)# Found 3 resource(s) that changed outside of Terraform (drifted)## Unmanaged resources:# aws_security_group.legacy-bastion-sg# aws_s3_bucket.old-deploy-bucket-2019# aws_iam_user.rahul-personal-access## Drifted resources:# aws_instance.app_server[0]: instance_type changed (t3.medium -> t3.large)# aws_security_group_rule.app_ingress: from_port changed (443 -> 8443)# aws_db_instance.main: backup_retention_period changed (7 -> 14)COMMON MISTAKE / WARNING**Common Mistake:** Assuming terraform plan gives you the complete drift picture. It only compares resources that exist in your state file against reality. An EC2 instance someone clicked into existence and never imported into Terraform is completely invisible to terraform plan — but Driftctl surfaces it.
Step 6 — Prevent Drift with IAM Policies
The strongest long-term fix for drift is making direct console modifications impossible for Terraform-managed resources:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "DenyManualChangesToTerraformResources", "Effect": "Deny", "Action": [ "ec2:ModifyInstanceAttribute", "ec2:AuthorizeSecurityGroupIngress", "ec2:RevokeSecurityGroupIngress", "rds:ModifyDBInstance" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/ManagedBy": "terraform" }, "StringNotEquals": { "aws:PrincipalArn": "arn:aws:iam::123456789012:role/github-actions-terraform" } } } ]}This policy denies modification of any resource tagged ManagedBy: terraform to anyone except the GitHub Actions role. Console engineers, direct CLI access, and even other automation tools are blocked. Only the CI pipeline can change these resources.
Tag your Terraform resources consistently so this policy covers everything:
# locals.tf — shared tagging for all resourceslocals { common_tags = { ManagedBy = "terraform" # required for IAM drift prevention policy Environment = var.environment # prod, staging, dev Project = "payment-api" # matches Razorpay's service name Team = "platform-engineering" # owning team }} # Apply tags on every resourceresource "aws_instance" "app_server" { ami = data.aws_ami.ubuntu.id instance_type = "t3.medium" tags = local.common_tags # this tag triggers drift prevention policy}Step 7 — The Drift Incident Response Procedure
Follow this every time drift is discovered — whether through a scheduled job alert or a confusing PR plan:
STEP 1 — Identify Run terraform plan -out=incident.tfplan and read the full output. Note which resources drifted and what changed. STEP 2 — Trace Check AWS CloudTrail for who made the change and when. aws cloudtrail lookup-events \ --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-0abc123 \ --start-time 2024-01-01T00:00:00Z STEP 3 — Decide Was the manual change intentional and correct? YES -> update the .tf file to match, open a PR, then apply NO -> run terraform apply to revert, document in incident channel STEP 4 — Document Post in your incident channel: - What drifted - Why the manual change was made - What action was taken (reverted or codified) - PR link if config was updated STEP 5 — Prevent If this drift was caused by a gap in IAM policy, update the policy. If it was caused by a broken on-call runbook, update the runbook.REMEMBER THIS**Remember:** During an active incident, stop the incident first. Preserve Terraform consistency second. Make the manual fix, then follow this procedure once the incident is resolved. Never let Terraform purity get in the way of restoring service.
Production Best Practices and Common Pitfalls
Never let drift sit for more than one sprint. The longer a manual change goes un-reconciled, the more likely someone will revert it accidentally — or build on top of it without knowing it is drifted.
Treat -refresh-only as half the job. Updating state without updating the .tf file means the next plan will show the same drift again. Always follow up with a code change.
Run Driftctl on a schedule. Unmanaged resources accumulate silently over months and years. One-time scans only show today's problem.
Add
lifecycle { ignore_changes = [...] }for auto-managed attributes. Some AWS resources self-modify legitimately — like Auto Scaling groups adjusting desired_capacity or AMIs being rotated by automation. These will appear as drift on every plan. Use ignore_changes to tell Terraform these attributes are intentionally managed outside Terraform.
resource "aws_autoscaling_group" "app" { # ... other config ... lifecycle { # desired_capacity is managed by Auto Scaling policies — not Terraform ignore_changes = [desired_capacity] }}- Combine prevention with detection. IAM policies prevent most console drift. Scheduled detection catches the rest — including emergency break-glass changes and changes made by other teams' automation.
Quick Reference and Troubleshooting Commands
| Command | What It Does |
|---|---|
terraform plan |
Detects drift by comparing real infra to state and config |
terraform plan -detailed-exitcode |
Returns exit code 2 when drift exists — useful in CI scripts |
terraform apply -refresh-only |
Updates state to match reality WITHOUT changing real infra |
terraform apply |
Reconciles drift by reverting real infra to match config |
terraform refresh |
Deprecated standalone refresh — use plan or apply -refresh-only instead |
driftctl scan |
Finds both drifted AND entirely unmanaged resources |
terraform plan -refresh=false |
Skips refresh entirely — faster but will not detect drift |
| Symptom | Root Cause | Fix |
|---|---|---|
| Unexplained diff appears in an unrelated PR plan | Someone made a manual change weeks ago, never reconciled | Trace via CloudTrail, decide reconcile vs update config, document it |
| terraform plan shows no drift but resource behaviour is wrong | The drifted resource was never in state — entirely unmanaged | Run driftctl scan to find unmanaged resources |
| Drift reappears on every plan despite running -refresh-only | The .tf file was never updated to match after the refresh | Update the resource block in code, not just the state |
| Scheduled drift job always fires, even with no real changes | An attribute that changes naturally (ASG desired_capacity, AMI ID) | Add lifecycle { ignore_changes = [...] } for that attribute |
| terraform force-unlock fails | Lock ID is wrong or lock is already released | Run terraform plan to confirm lock status, get correct ID from DynamoDB |