What is the career path for learning Terraform Drift Detection — Finding and Fixing Infrastructure That Changed?

Mastering Terraform Drift Detection — Finding and Fixing Infrastructure That Changed enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Terraform Drift Detection — Finding and Fixing Infrastructure That Changed | DevOps Network

Q: How long does it take to learn Terraform Drift Detection — Finding and Fixing Infrastructure That Changed?

Most students gain core proficiency in Terraform Drift Detection — Finding and Fixing Infrastructure That Changed in 2–3 weeks of active hands-on labs.

Overview and What You Will Learn

Drift happens when an engineer logs into the AWS console at 2am during an incident and manually bumps a security group rule, or widens an RDS instance's storage, just to make the immediate problem go away. The fix works. The incident closes. But now your Terraform configuration still describes the OLD rule, and your Terraform state still believes the OLD rule is what is deployed. Reality moved. Terraform's understanding of reality did not. That gap is drift — and left unmanaged, it quietly grows until nobody trusts terraform plan output anymore because it is always full of unexplained changes nobody asked for.

By the end of this topic you will be able to:

Explain exactly how and why drift happens, with real incident-response scenarios
Use terraform plan as your first and fastest drift detection tool
Understand what -refresh-only does and how it differs from a full apply
Set up scheduled, automated drift detection in CI
Use Driftctl as a dedicated, more comprehensive drift detection tool
Decide, case by case, whether to reconcile drift back to your config or update your config to match reality
Prevent drift at the source with IAM policies that block console changes on Terraform-managed resources
Follow a clear incident response procedure the next time drift happens in production

Why This Matters in Production

At Hotstar, during a high-traffic cricket match, an on-call engineer noticed an ALB target group was returning unhealthy targets and manually changed the health check path directly in the AWS console to stop the bleeding — a completely reasonable, correct call under incident pressure. Three weeks later, a routine terraform plan on an unrelated PR showed an unexpected change to that same health check path, with no explanation in the diff. The reviewer almost approved it as probably fine before someone traced it back to the 2am incident fix that nobody had ever told Terraform about. Without that trace, the PR would have silently reverted a deliberate, still-needed fix back to the old broken health check — right before the next big match.

Drift is not just a technical problem. It is a trust problem. When your team stops believing terraform plan output, you have lost the most important safety net in your infrastructure workflow.

Core Principles

The Drift Lifecycle

Bash

+------------------------------------------+
| Manual change made OUTSIDE Terraform      | <- console, CLI, or another tool
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Real infra now differs from state file    | <- Terraform does not know yet
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| terraform plan reads real infrastructure  | <- drift is detected here
| and shows an unexpected diff              |
+------------------------------------------+
                    |
        +-----------+-----------+
        |                       |
        v                       v
+------------------+   +----------------------+
| RECONCILE         |   | UPDATE CONFIG         |
| terraform apply   |   | Update .tf to match   |
| reverts to match  |   | reality, then         |
| the .tf config    |   | terraform apply       |
+------------------+   +----------------------+

What Drift Is Not

Drift is not the same as a bug in your Terraform code. Drift means real infrastructure diverged from what Terraform believes is deployed. Your code might be perfectly correct — the problem is that someone changed reality without going through Terraform.

REMEMBER THIS
**Remember:** terraform plan always refreshes its view of real infrastructure before comparing to your config. Every time you run plan, you are also running drift detection — whether you intended to or not.

Detailed Step-by-Step Practical Lab

Step 1 — Detect Drift with terraform plan

terraform plan is your default drift detector. It reads real infrastructure state from the cloud provider API before comparing anything:

Bash

terraform plan
 
# Output when drift is detected:
# Terraform detected the following changes made outside of Terraform since the last
# "terraform apply":
#
# ~ resource "aws_security_group_rule" "app_ingress" {
#     ~ from_port = 443 -> 8443   # someone changed this in the console
#   }
#
# Unless you have made equivalent changes to your configuration, or ignored the
# relevant attributes using ignore_changes, the following plan may include actions
# to undo or respond to these changes.

The phrase "Terraform detected the following changes made outside of Terraform" is explicit. Terraform is telling you: this diff was not caused by your code — reality moved without you.

PLACEMENT PRO TIP
**Tip:** Read every plan output top to bottom before approving. The drift section appears ABOVE the planned changes section. Engineers who skim plans often miss it entirely.

Step 2 — Update State Only, Without Changing Infrastructure

If the manual change was intentional and correct — like the Hotstar health check fix — use -refresh-only to acknowledge it in state without touching real infrastructure:

Bash

terraform apply -refresh-only
 
# Terraform shows the same drift diff, but asks a different question:
# "Do you want to update the Terraform state to reflect these detected changes?"
#
# Answering yes:
# - Updates ONLY the state file
# - Makes ZERO API calls to AWS
# - Does NOT change any real resources
# - Does NOT revert the manual change
 
# Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
# (State was updated to reflect the manual changes.)

COMMON MISTAKE / WARNING
**Security:** Running -refresh-only without then updating your .tf file leaves your code out of sync with state. The next engineer who reads the .tf file will see the wrong config. Always follow a -refresh-only apply with a matching .tf update and PR.

Step 3 — Reconcile Drift Back to Your Configuration

If the manual change was wrong, or a temporary emergency fix you do not want to keep, a normal terraform apply reverts reality back to match your code:

Bash

# First, confirm what drift exists
terraform plan
# ~ aws_security_group_rule.app_ingress will be updated in-place
#   ~ from_port = 8443 -> 443   <- reverting the manual change
 
# Apply reverts the real security group back to port 443
terraform apply
# aws_security_group_rule.app_ingress: Modifying...
# aws_security_group_rule.app_ingress: Modifications complete after 2s
# Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Step 4 — Automate Scheduled Drift Detection in CI

Do not wait for an unrelated PR to stumble onto drift. Run a scheduled read-only plan against production every day:

YAML

# .github/workflows/drift-detection.yml
name: Scheduled Drift Detection
 
on:
  schedule:
    - cron: "0 6 * * *"   # runs every day at 6am UTC
  workflow_dispatch:       # also allows manual trigger
 
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    environment: production   # uses the same OIDC role as your apply workflow
 
    permissions:
      id-token: write   # required for OIDC
      contents: read
 
    steps:
      - uses: actions/checkout@v4
 
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.0"
 
      - name: Configure AWS credentials via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-drift-readonly
          aws-region: ap-south-1
 
      - name: Terraform Init
        run: terraform init
        working-directory: environments/prod
 
      - name: Detect Drift
        id: plan
        run: |
          # -detailed-exitcode returns:
          # exit 0 = no changes (no drift)
          # exit 1 = error
          # exit 2 = changes detected (drift found)
          terraform plan -detailed-exitcode -out=drift.tfplan 2>&1 | tee plan_output.txt
        working-directory: environments/prod
        continue-on-error: true   # do not fail the job on exit code 2
 
      - name: Alert on Drift
        if: steps.plan.outputs.exitcode == '2'
        run: |
          # Post a Slack alert when drift is found
          curl -s -X POST "${{ secrets.SLACK_WEBHOOK_URL }}" \
            -H 'Content-type: application/json' \
            --data "{
              \"text\": \":rotating_light: *Drift detected in prod!* Terraform plan shows unexpected changes.\\nRun ID: ${{ github.run_id }}\\nCheck the workflow for details.\"
            }"
      - name: Fail job if drift found
        if: steps.plan.outputs.exitcode == '2'
        run: exit 1   # makes the scheduled job red in GitHub UI

PLACEMENT PRO TIP
**Tip:** Create a read-only IAM role specifically for the drift detection job. It only needs `terraform plan` permissions — not apply. This limits blast radius if the CI credentials are ever compromised.

Step 5 — Use Driftctl for Deeper Detection

terraform plan only detects drift on resources that exist in your .tf files. Driftctl also finds resources that were created in your AWS account and never touched by Terraform at all:

Bash

# Install driftctl
curl -sL https://raw.githubusercontent.com/snyk/driftctl/main/install.sh | bash
 
# Authenticate to AWS (same credentials as Terraform)
export AWS_PROFILE=mumbai-prod
 
# Scan your AWS account against your Terraform state
driftctl scan --from tfstate+s3://razorpay-terraform-state/prod/terraform.tfstate
 
# Example output:
# Scanned resources:  (142)
# Found 9 resource(s) not under Terraform management (unmanaged)
# Found 3 resource(s) that changed outside of Terraform (drifted)
#
# Unmanaged resources:
# aws_security_group.legacy-bastion-sg
# aws_s3_bucket.old-deploy-bucket-2019
# aws_iam_user.rahul-personal-access
#
# Drifted resources:
# aws_instance.app_server[0]: instance_type changed (t3.medium -> t3.large)
# aws_security_group_rule.app_ingress: from_port changed (443 -> 8443)
# aws_db_instance.main: backup_retention_period changed (7 -> 14)

COMMON MISTAKE / WARNING
**Common Mistake:** Assuming terraform plan gives you the complete drift picture. It only compares resources that exist in your state file against reality. An EC2 instance someone clicked into existence and never imported into Terraform is completely invisible to terraform plan — but Driftctl surfaces it.

Step 6 — Prevent Drift with IAM Policies

The strongest long-term fix for drift is making direct console modifications impossible for Terraform-managed resources:

JSON

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyManualChangesToTerraformResources",
      "Effect": "Deny",
      "Action": [
        "ec2:ModifyInstanceAttribute",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:RevokeSecurityGroupIngress",
        "rds:ModifyDBInstance"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/ManagedBy": "terraform"
        },
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::123456789012:role/github-actions-terraform"
        }
      }
    }
  ]
}

This policy denies modification of any resource tagged ManagedBy: terraform to anyone except the GitHub Actions role. Console engineers, direct CLI access, and even other automation tools are blocked. Only the CI pipeline can change these resources.

Tag your Terraform resources consistently so this policy covers everything:

NGINX

# locals.tf — shared tagging for all resources
locals {
  common_tags = {
    ManagedBy   = "terraform"            # required for IAM drift prevention policy
    Environment = var.environment        # prod, staging, dev
    Project     = "payment-api"          # matches Razorpay's service name
    Team        = "platform-engineering" # owning team
  }
}
 
# Apply tags on every resource
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.medium"
 
  tags = local.common_tags   # this tag triggers drift prevention policy
}

Step 7 — The Drift Incident Response Procedure

Follow this every time drift is discovered — whether through a scheduled job alert or a confusing PR plan:

Bash

STEP 1 — Identify
  Run terraform plan -out=incident.tfplan and read the full output.
  Note which resources drifted and what changed.
 
STEP 2 — Trace
  Check AWS CloudTrail for who made the change and when.
  aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=ResourceName,AttributeValue=sg-0abc123 \
    --start-time 2024-01-01T00:00:00Z
 
STEP 3 — Decide
  Was the manual change intentional and correct?
    YES -> update the .tf file to match, open a PR, then apply
    NO  -> run terraform apply to revert, document in incident channel
 
STEP 4 — Document
  Post in your incident channel:
  - What drifted
  - Why the manual change was made
  - What action was taken (reverted or codified)
  - PR link if config was updated
 
STEP 5 — Prevent
  If this drift was caused by a gap in IAM policy, update the policy.
  If it was caused by a broken on-call runbook, update the runbook.

REMEMBER THIS
**Remember:** During an active incident, stop the incident first. Preserve Terraform consistency second. Make the manual fix, then follow this procedure once the incident is resolved. Never let Terraform purity get in the way of restoring service.

Production Best Practices and Common Pitfalls

Never let drift sit for more than one sprint. The longer a manual change goes un-reconciled, the more likely someone will revert it accidentally — or build on top of it without knowing it is drifted.
Treat -refresh-only as half the job. Updating state without updating the .tf file means the next plan will show the same drift again. Always follow up with a code change.
Run Driftctl on a schedule. Unmanaged resources accumulate silently over months and years. One-time scans only show today's problem.
Add lifecycle { ignore_changes = [...] } for auto-managed attributes. Some AWS resources self-modify legitimately — like Auto Scaling groups adjusting desired_capacity or AMIs being rotated by automation. These will appear as drift on every plan. Use ignore_changes to tell Terraform these attributes are intentionally managed outside Terraform.

NGINX

resource "aws_autoscaling_group" "app" {
  # ... other config ...
 
  lifecycle {
    # desired_capacity is managed by Auto Scaling policies — not Terraform
    ignore_changes = [desired_capacity]
  }
}

Combine prevention with detection. IAM policies prevent most console drift. Scheduled detection catches the rest — including emergency break-glass changes and changes made by other teams' automation.

Quick Reference and Troubleshooting Commands

Command	What It Does
`terraform plan`	Detects drift by comparing real infra to state and config
`terraform plan -detailed-exitcode`	Returns exit code 2 when drift exists — useful in CI scripts
`terraform apply -refresh-only`	Updates state to match reality WITHOUT changing real infra
`terraform apply`	Reconciles drift by reverting real infra to match config
`terraform refresh`	Deprecated standalone refresh — use plan or apply -refresh-only instead
`driftctl scan`	Finds both drifted AND entirely unmanaged resources
`terraform plan -refresh=false`	Skips refresh entirely — faster but will not detect drift

Symptom	Root Cause	Fix
Unexplained diff appears in an unrelated PR plan	Someone made a manual change weeks ago, never reconciled	Trace via CloudTrail, decide reconcile vs update config, document it
terraform plan shows no drift but resource behaviour is wrong	The drifted resource was never in state — entirely unmanaged	Run driftctl scan to find unmanaged resources
Drift reappears on every plan despite running -refresh-only	The .tf file was never updated to match after the refresh	Update the resource block in code, not just the state
Scheduled drift job always fires, even with no real changes	An attribute that changes naturally (ASG desired_capacity, AMI ID)	Add lifecycle { ignore_changes = [...] } for that attribute
terraform force-unlock fails	Lock ID is wrong or lock is already released	Run terraform plan to confirm lock status, get correct ID from DynamoDB

Terraform Drift Detection — Finding and Fixing Infrastructure That Changed

Overview and What You Will Learn

Why This Matters in Production

Core Principles

The Drift Lifecycle

What Drift Is Not

Detailed Step-by-Step Practical Lab

Step 1 — Detect Drift with terraform plan

Step 2 — Update State Only, Without Changing Infrastructure

Step 3 — Reconcile Drift Back to Your Configuration

Step 4 — Automate Scheduled Drift Detection in CI

Step 5 — Use Driftctl for Deeper Detection

Step 6 — Prevent Drift with IAM Policies

Step 7 — The Drift Incident Response Procedure

Production Best Practices and Common Pitfalls

Quick Reference and Troubleshooting Commands

Resources

Explore More in Terraform in Production — CI/CD, Secrets, and Governance

Terraform CI/CD — Automating Plan and Apply with GitHub Actions

Atlantis — Pull Request Automation for Terraform Teams