Terraform state is simple when you work alone and a nightmare when five teams share it. Here is the complete guide to remote backends, locking, and drift management at scale.
Status: DRAFT
Terraform state works perfectly when one engineer runs it locally on one project. It starts causing problems the moment a second engineer runs terraform apply at the same time. By the time you have five teams, multiple environments, and hundreds of resources, state management has become the most dangerous part of your infrastructure workflow.
This is the article that covers what actually goes wrong and how to design a state architecture that survives real multi-team usage.
Terraform uses the state file to map the resources it manages to the real infrastructure they correspond to. Without state, Terraform cannot know whether aws_instance.web in your config is an EC2 instance that already exists or one that needs to be created.
The state file is also the source of truth for your current infrastructure — it stores resource IDs, attribute values, and dependency relationships. It is, in a very real sense, more important than your Terraform code.
When state lives on a local filesystem, the problems are immediate:
Remote backends solve the storage problem. State locking solves the concurrency problem. Drift detection solves the divergence problem. All three need to be in place before Terraform is used at team scale.
AWS S3 with DynamoDB locking is the most common remote backend setup. S3 stores the state file, DynamoDB handles the lock.
terraform { backend "s3" { bucket = "platform-terraform-state" key = "prod/payment-service/terraform.tfstate" region = "ap-south-1" ## Mumbai region encrypt = true dynamodb_table = "terraform-state-lock" kms_key_id = "arn:aws:kms:ap-south-1:123456789:key/abc-123" }}## Create the DynamoDB lock table (one-time setup)aws dynamodb create-table \ --table-name terraform-state-lock \ --attribute-definitions AttributeName=LockID,AttributeType=S \ --key-schema AttributeName=LockID,KeyType=HASH \ --billing-mode PAY_PER_REQUEST \ --region ap-south-1Every terraform apply now acquires a lock in DynamoDB before modifying state, and releases it when done. Concurrent applies will see the lock and fail immediately with a clear error — no silent corruption.
Enable S3 versioning on the state bucket. This is not optional:
aws s3api put-bucket-versioning \ --bucket platform-terraform-state \ --versioning-configuration Status=EnabledWith versioning enabled, every apply creates a new version of the state file. When an apply corrupts state (it happens), you restore the previous version in thirty seconds.
The biggest mistake multi-team orgs make with Terraform state is using a single state file. When the payment team and the data platform team share one state file, the payment team's apply can destroy a data platform resource if someone wrote a bad resource block.
The solution is state separation by blast radius. Each independent unit of infrastructure gets its own state file:
platform-terraform-state/ prod/ networking/terraform.tfstate # VPCs, subnets, NAT payment-service/terraform.tfstate # payment app infra order-service/terraform.tfstate # order app infra data-platform/terraform.tfstate # Kafka, S3, Redshift monitoring/terraform.tfstate # Prometheus, Grafana staging/ ... dev/ ...This means a broken payment-service apply cannot affect data-platform resources. The blast radius of any failure is bounded by the state file boundary.
Use Terraform's remote_state data source to share outputs between state files without merging them:
## In payment-service, read VPC ID from networking statedata "terraform_remote_state" "networking" { backend = "s3" config = { bucket = "platform-terraform-state" key = "prod/networking/terraform.tfstate" region = "ap-south-1" }} resource "aws_instance" "payment_api" { subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_id # ...}The networking team manages VPCs. The payment team consumes the subnet ID as a read-only reference. Neither team can accidentally modify the other's resources.
When a lock is held and you try to apply, you see:
Error: Error acquiring the state lock Lock Info: ID: abc-123-def-456 Path: prod/payment-service/terraform.tfstate Operation: OperationTypeApply Who: jenkins@ci.internal.yourplatform.net Version: 1.7.4 Created: 2024-03-15T10:22:31.123456789ZThis tells you exactly who holds the lock, when they acquired it, and from which machine. If it's a CI job that crashed without releasing the lock, you force-unlock with the ID:
terraform force-unlock abc-123-def-456Never force-unlock while a legitimate apply is running. Check the CI job status first. Forcing an unlock during an active apply is how you corrupt state.
Drift is when the real infrastructure diverges from what Terraform's state file says it should be. Someone ran aws ec2 modify-instance-attribute directly. An auto-scaling event changed something. A manual fix was applied during an incident.
## Check for drift without making any changesterraform plan -refresh-only ## If drift is intentional, import it into stateterraform import aws_instance.payment_api i-0abc123def456For automated drift detection, run terraform plan -refresh-only in CI on a schedule:
## GitHub Actions drift detection workflowname: Drift Detectionon: schedule: - cron: '0 8 * * 1-5' ## 8 AM IST weekdays (2:30 AM UTC) jobs: detect-drift: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Configure AWS Credentials uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789:role/TerraformReadOnly - name: Check for Drift run: | terraform init terraform plan -refresh-only -detailed-exitcode ## exit code 2 means drift detectedWhen this job exits with code 2, it means something changed outside of Terraform. The CI notification tells your team immediately instead of discovering it during the next apply.
If managing S3 + DynamoDB backends across twenty modules is too much overhead, Terraform Cloud (now HCP Terraform) handles backends, locking, and run history as a managed service.
The backend config simplifies to:
terraform { cloud { organization = "your-org" workspaces { name = "payment-service-prod" } }}Runs execute remotely, state is managed, and you get a full audit log of who applied what. The free tier supports up to 500 resources per month — enough for small teams to use it without cost.
Never run terraform apply directly from a developer's laptop in production. All production applies should go through a CI pipeline with approval gates. This creates an audit trail and prevents "works on my machine" state corruption.
Use separate AWS IAM roles for Terraform CI vs Terraform read-only. Drift detection doesn't need write permissions — give it a role that can only read state and describe resources. Reserve write permissions for the apply role, and require MFA or OIDC for that role.
Tag every Terraform-managed resource with a managed-by = terraform tag and the workspace or state path. When you find a resource in the console that looks unfamiliar, the tag tells you which state file owns it.
## Apply to every module via providers.tfprovider "aws" { default_tags { tags = { ManagedBy = "terraform" Environment = var.environment StateFile = "prod/payment-service" Team = "payments-squad" } }}| Backend | Locking | Cost | Complexity |
|---|---|---|---|
| Local | None | Free | None (dangerous) |
| S3 + DynamoDB | Yes | Very low | Low |
| HCP Terraform | Yes | Free tier | Very low |
| Terraform Enterprise | Yes | High | Medium |
S3 + DynamoDB is the right default for AWS-native teams. HCP Terraform is better when you want managed runs, policy enforcement, and don't want to maintain the backend infrastructure.
INFORMATION📚 **References & Further Reading** * [Terraform Backend Configuration](https://developer.hashicorp.com/terraform/language/settings/backends/configuration) - Official backend docs * [HCP Terraform](https://app.terraform.io/) - Managed Terraform runs and state * [Terraform S3 Backend](https://developer.hashicorp.com/terraform/language/settings/backends/s3) - S3 + DynamoDB setup reference * [Terraform Import](https://developer.hashicorp.com/terraform/cli/import) - Importing existing resources into state
Run terraform state list to identify partially created resources, then use terraform import to pull any successfully created cloud resources back into state. For resources that failed mid-creation, manually delete them from the cloud provider console first, then re-run terraform apply. Never run terraform destroy after a partial apply without first reconciling state, as it may target unintended resources.
Use terraform state mv to rename or reorganize resource addresses within state before updating module references in HCL. This preserves resource IDs and prevents Terraform from destroying and recreating resources due to address changes. Run terraform plan after every state mv to verify the plan shows zero resource changes before committing the refactored module code.
Discussion0