What is the career path for learning Terraform Data Sources — Reading Existing Infrastructure Without Managing It?

Mastering Terraform Data Sources — Reading Existing Infrastructure Without Managing It enables engineering opportunities in DevOps, SRE, and cloud platform automation.

Terraform Data Sources — Reading Existing Infrastructure Without Managing It | DevOps Network

Q: How long does it take to learn Terraform Data Sources — Reading Existing Infrastructure Without Managing It?

Most students gain core proficiency in Terraform Data Sources — Reading Existing Infrastructure Without Managing It in 2–3 weeks of active hands-on labs.

Overview and What You Will Learn

A Terraform resource block creates something and takes ownership of it — Terraform will create it, update it, and destroy it. But sometimes you need to reference something that already exists and belongs to someone else. The VPC your networking team created. The SSL certificate your security team manages. The latest Amazon Linux AMI that changes every few weeks. You need the value, but you do not want Terraform to manage — or destroy — the thing.

That is exactly what a data source is for. A data source reads existing information from a cloud provider and makes it available in your configuration, without creating anything or taking ownership of anything.

In this lab you will learn:

The difference between a resource block and a data block — when to use each
How to fetch the latest Amazon Linux 2 AMI without hardcoding an AMI ID
How to read an existing VPC, subnet, and security group created by another team
How to pull secrets from AWS Secrets Manager without storing them in state
How to use aws_caller_identity and aws_region to make configs self-aware
How data source results flow into resource arguments

Why This Matters in Production

At Hotstar, the networking team owns the VPC and subnets. The platform team owns the EC2 and RDS resources. The security team owns the SSL certificates. No team creates what another team owns — but every team needs to reference what the others create.

Without data sources, the platform team would have to hardcode VPC IDs (vpc-0a1b2c3d4e) and subnet IDs (subnet-0a1b2c3d4e) into their Terraform configuration. When the networking team recreates the VPC for a new region, every downstream configuration breaks and needs manual updates.

With data sources, the platform team looks up the VPC by name tag at plan time. The VPC ID is fetched dynamically. The configuration is portable across regions and environments — no hardcoded IDs to maintain.

Core Principles

◈ DIAGRAM

+------------------------------------------+
| data block (reads, never creates)        |
| data "aws_vpc" "main" {                  |
|   tags = { Name = "hotstar-prod-vpc" }   |
| }                                        |
+------------------------------------------+
                    |
                    | (read-only API call at plan time)
                    v
+------------------------------------------+
| Cloud Provider API                       |
| Fetches: VPC id, cidr_block, owner_id    |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| Data source result available as:         |
| data.aws_vpc.main.id                     |
| data.aws_vpc.main.cidr_block             |
+------------------------------------------+
                    |
                    v
+------------------------------------------+
| resource block (uses the read value)     |
| resource "aws_subnet" "app" {            |
|   vpc_id = data.aws_vpc.main.id          |
| }                                        |
+------------------------------------------+

Key difference: data vs resource

◈ DIAGRAM

+------------------------+          +------------------------------+
|  resource block        |          |  data block                  |
|                        |          |                              |
|  Creates new object    |          |  Reads existing object       |
|  Stored in state       |          |  NOT stored in state         |
|  Can be destroyed      |          |  Cannot be destroyed         |
|  Terraform owns it     |          |  Terraform reads it only     |
+------------------------+          +------------------------------+

Detailed Step-by-Step Practical Lab

Part 1 — Data Block Syntax

The syntax of a data block mirrors a resource block — data keyword, type, and local name — but it reads instead of creates:

NGINX

# resource block — creates and manages
resource "aws_instance" "web" {
  ami           = "ami-0f5ee92e2d63afc18"
  instance_type = "t3.micro"
}
 
# data block — reads existing, creates nothing
data "aws_ami" "amazon_linux" {
  most_recent = true           # if multiple match, return the newest one
  owners      = ["amazon"]     # only AMIs published by Amazon, not community
 
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]   # Amazon Linux 2 naming pattern
  }
 
  filter {
    name   = "virtualization-type"
    values = ["hvm"]   # hardware virtual machine — required for t3/t4 instances
  }
}
 
# Reference the data source result with data.<type>.<name>.<attribute>
resource "aws_instance" "web" {
  ami           = data.aws_ami.amazon_linux.id   # e.g., ami-0f5ee92e2d63afc18
  instance_type = "t3.micro"
}

PLACEMENT PRO TIP
**Tip:** Notice the reference format: `data.aws_ami.amazon_linux.id` — it always starts with `data.` to distinguish data source references from resource references (`aws_instance.web.id`).

Part 2 — AMI Data Source (Most Common Use Case)

Hardcoding an AMI ID (ami-0f5ee92e2d63afc18) is a maintenance trap. Amazon releases patched AMIs regularly. A hardcoded ID eventually points to an outdated AMI — or one that no longer exists in a new region.

The aws_ami data source always fetches the current AMI matching your filter:

NGINX

# data.tf — all data sources in one file for clarity
 
# Amazon Linux 2 — for EC2 instances running yum-based workloads
data "aws_ami" "amazon_linux_2" {
  most_recent = true
  owners      = ["amazon"]
 
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
 
  filter {
    name   = "state"
    values = ["available"]   # only fetch AMIs that are ready to use
  }
}
 
# Amazon Linux 2023 — newer, AL2-compatible with dnf
data "aws_ami" "amazon_linux_2023" {
  most_recent = true
  owners      = ["amazon"]
 
  filter {
    name   = "name"
    values = ["al2023-ami-*-kernel-*-x86_64"]
  }
}
 
# Ubuntu 22.04 LTS — for apt-based workloads
data "aws_ami" "ubuntu_22_04" {
  most_recent = true
  owners      = ["099720109477"]   # Canonical's AWS account ID — always verify this
 
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
 
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

NGINX

# Using the AMI data sources in resources
resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux_2.id    # always current, always patched
  instance_type = "t3.medium"
 
  tags = {
    Name    = "razorpay-api-app"
    AMIUsed = data.aws_ami.amazon_linux_2.name   # log which AMI was used
  }
}
 
# Output the AMI ID so you can see what was used
output "ami_id_used" {
  description = "AMI ID selected by the data source — for auditability"
  value       = data.aws_ami.amazon_linux_2.id
}

Part 3 — VPC and Subnet Data Sources (Cross-Team References)

In production, the networking team creates the VPC and subnets. The application team creates EC2 and RDS. Data sources let the application team reference the VPC without owning it.

NGINX

# Read a VPC by its Name tag — no hardcoded VPC ID
data "aws_vpc" "main" {
  tags = {
    Name        = "hotstar-prod-vpc"   # must match exactly
    Environment = "prod"
  }
}
 
# Read all private subnets in that VPC
data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.main.id]   # use the VPC we just read
  }
 
  tags = {
    Tier = "private"   # networking team tags their subnets — read the tag
  }
}
 
# Read a specific subnet by tag
data "aws_subnet" "app_primary" {
  vpc_id = data.aws_vpc.main.id
  tags = {
    Name = "hotstar-prod-private-ap-south-1a"
  }
}

NGINX

# Use the VPC data source result in RDS subnet group
resource "aws_db_subnet_group" "app" {
  name       = "razorpay-payments-db-subnet-group"
  subnet_ids = data.aws_subnets.private.ids   # list of all private subnet IDs
 
  tags = { Name = "razorpay-payments" }
}
 
# Use the VPC in a security group
resource "aws_security_group" "rds" {
  name   = "razorpay-rds-sg"
  vpc_id = data.aws_vpc.main.id   # must be in same VPC as the RDS instance
}

Part 4 — Account and Region Data Sources (Self-Aware Configs)

These two data sources make your configuration environment-aware without hardcoding account IDs or region strings:

# aws_caller_identity — reads the current AWS account details
data "aws_caller_identity" "current" {}
# No filter arguments needed — reads the caller (the IAM user or role running Terraform)
 
# aws_region — reads the currently configured AWS region
data "aws_region" "current" {}
 
# aws_availability_zones — lists all AZs in the current region
data "aws_availability_zones" "available" {
  state = "available"   # only fetch AZs that are currently operational
}

APPLESCRIPT

# Use account ID for globally unique S3 bucket names
resource "aws_s3_bucket" "terraform_state" {
  # S3 bucket names are globally unique — include account ID to guarantee uniqueness
  bucket = "razorpay-terraform-state-${data.aws_caller_identity.current.account_id}"
}
 
# Use region in ARN construction
resource "aws_cloudwatch_log_group" "app" {
  name = "/aws/ec2/${data.aws_region.current.name}/app"
  # Result: /aws/ec2/ap-south-1/app
}
 
# Use available AZs for subnet distribution
resource "aws_subnet" "private" {
  # Create one subnet per AZ — works in any region without hardcoding AZ names
  count             = length(data.aws_availability_zones.available.names)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
}

Part 5 — Security Group and IAM Data Sources

PGSQL

# Read an existing security group by name — often owned by another team
data "aws_security_group" "bastion" {
  name   = "hotstar-bastion-sg"
  vpc_id = data.aws_vpc.main.id
}
 
# Read an existing IAM role — to attach a policy without owning the role
data "aws_iam_role" "ec2_instance_role" {
  name = "hotstar-ec2-instance-role"
}
 
# Read an existing IAM policy by ARN
data "aws_iam_policy" "ssm_managed" {
  arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

CRMSH

# Attach the existing IAM policy to the existing role
resource "aws_iam_role_policy_attachment" "ssm" {
  role       = data.aws_iam_role.ec2_instance_role.name
  policy_arn = data.aws_iam_policy.ssm_managed.arn
}

Part 6 — Secrets Manager Data Source (Secrets Without State Exposure)

Reading a secret from AWS Secrets Manager via a data source is safer than storing the secret as a Terraform variable — the secret value never appears in the plan output or state file as a Terraform-managed value.

NGINX

# Read the secret metadata (name, ARN, description)
data "aws_secretsmanager_secret" "db_password" {
  name = "razorpay/payments-db/master-password"
}
 
# Read the current secret value
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = data.aws_secretsmanager_secret.db_password.id
}

NGINX

# Use the secret value in an RDS resource
resource "aws_db_instance" "payments" {
  identifier = "razorpay-payments-prod"
  engine     = "postgres"
  username   = "payments_admin"
 
  # Read from Secrets Manager — not stored as a Terraform variable
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
 
  instance_class    = "db.r6g.large"
  allocated_storage = 500
}

COMMON MISTAKE / WARNING
**Security:** Even when reading from Secrets Manager via a data source, the secret value is still stored in the Terraform state file after apply (as part of the `aws_db_instance` resource attributes). The state file must be encrypted (S3 SSE) and access-controlled. For true secret isolation, use the Vault provider with dynamic credentials instead.

Part 7 — SSM Parameter Store Data Source

AWS Systems Manager Parameter Store is a good place for configuration values that are not secret but change per environment (database hostnames, service endpoints, feature flags):

PGSQL

# Read a plain string parameter
data "aws_ssm_parameter" "db_endpoint" {
  name = "/razorpay/${var.environment}/database/endpoint"
  # Result: "payments-prod.abc123.ap-south-1.rds.amazonaws.com"
}
 
# Read a SecureString parameter (encrypted with KMS)
data "aws_ssm_parameter" "api_key" {
  name            = "/razorpay/${var.environment}/third-party/api-key"
  with_decryption = true   # decrypt SecureString parameters
}

GAMS

# Use the parameter value in an ECS task definition environment variable
resource "aws_ecs_task_definition" "api" {
  family = "razorpay-api"
  container_definitions = jsonencode([{
    name  = "api"
    image = "razorpay/api:latest"
    environment = [
      {
        name  = "DB_ENDPOINT"
        value = data.aws_ssm_parameter.db_endpoint.value
      }
    ]
  }])
}

Part 8 — When Data Sources Fetch Their Data

Understanding when data sources run matters for plan-time vs apply-time behaviour:

CMAKE

# Data sources that run at PLAN time (most common):
# - aws_ami, aws_vpc, aws_subnet, aws_caller_identity, aws_region
# - These fetch data before any apply happens
# - Results are known in the plan output
 
# Data sources that may run at APPLY time:
# - Data sources that depend on resources being created first
# Example: reading a security group that Terraform creates in the same apply
resource "aws_security_group" "app" {
  name   = "app-sg"
  vpc_id = aws_vpc.main.id
}
 
# This data source can only read the security group after it is created
data "aws_security_group" "app_lookup" {
  name   = aws_security_group.app.name   # depends on the resource above
  vpc_id = aws_vpc.main.id
}
# In the plan output, data.aws_security_group.app_lookup.id will show as
# "(known after apply)" because the security group does not exist yet

REMEMBER THIS
**Remember:** If a data source result shows `(known after apply)` in the plan, it means the data source depends on a resource that has not been created yet. This is normal — the value will be known after the apply finishes.

Part 9 — Complete Example: Multi-Team Infrastructure Reference

This is the pattern used when multiple Terraform configurations manage different layers of infrastructure:

NGINX

# data.tf — reading everything the networking team created
 
# The VPC — networking team owns this, we just read it
data "aws_vpc" "main" {
  tags = { Name = "${var.environment}-vpc", Team = "networking" }
}
 
# Private subnets for our EC2 and RDS instances
data "aws_subnets" "private" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.main.id]
  }
  tags = { Tier = "private" }
}
 
# The bastion security group — we need to allow SSH from it
data "aws_security_group" "bastion" {
  name   = "${var.environment}-bastion-sg"
  vpc_id = data.aws_vpc.main.id
}
 
# Current account and region
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
 
# Latest Amazon Linux 2 AMI for this region
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]
  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}
 
# RDS password from Secrets Manager — not a Terraform variable
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "razorpay/${var.environment}/rds/master-password"
}

KOTLIN

# main.tf — using everything we read above
 
resource "aws_instance" "app" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = local.ec2_instance_type
  subnet_id     = data.aws_subnets.private.ids[0]
 
  vpc_security_group_ids = [
    aws_security_group.app.id,
    data.aws_security_group.bastion.id   # allow SSH from bastion — owned by another team
  ]
 
  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-app"
    AMI  = data.aws_ami.amazon_linux.id
  })
}
 
resource "aws_db_instance" "main" {
  identifier     = "${local.name_prefix}-postgres"
  engine         = "postgres"
  instance_class = var.database_config.instance_class
  db_subnet_group_name = aws_db_subnet_group.main.name
  password       = data.aws_secretsmanager_secret_version.db_password.secret_string
 
  tags = local.common_tags
}
 
resource "aws_db_subnet_group" "main" {
  name       = "${local.name_prefix}-db-subnet-group"
  subnet_ids = data.aws_subnets.private.ids   # all private subnets from data source
}

Production Best Practices and Common Pitfalls

Use data sources instead of hardcoding IDs. Never put vpc-0a1b2c3d4e5f or ami-0f5ee92e2d63afc18 directly in resource arguments. These IDs change per region and per account. Data sources make your config portable.
Tag your resources consistently so data sources can find them. A data source filtering by tags = { Name = "prod-vpc" } only works if the VPC was actually tagged that way. Establish a tagging standard before you start writing data sources that rely on tags.
Add most_recent = true to AMI data sources. Without it, if multiple AMIs match your filter, Terraform errors. With most_recent = true, Terraform picks the newest — which is always what you want for security patches.
Never use a data source to read something you just created in the same apply. If you create a VPC and then immediately use a data "aws_vpc" to look it up, the data source runs before the VPC exists (at plan time) and the plan fails. Reference the resource directly: aws_vpc.main.id not data.aws_vpc.main.id.
Pin the Canonical Ubuntu AMI owner ID. The value 099720109477 is Canonical's AWS account ID for Ubuntu AMIs. Verify this at https://cloud-images.ubuntu.com/locator/ec2/ — it does not change, but verifying protects you from malicious community AMIs with similar names.
Data source results are not stored in state. If the underlying resource changes (someone renames the VPC tag), your next plan or apply may fail because the data source can no longer find what it is looking for. This is a feature — it forces you to update your configuration when dependencies change.

Quick Reference and Troubleshooting Commands

Command	What It Does
`terraform plan`	Shows data source reads as `<= data.aws_ami.amazon_linux`
`terraform console`	Test data source expressions interactively
`terraform state list`	Data sources do NOT appear in state — only resources do
`terraform refresh`	Force data sources to re-fetch from the API

Error	Root Cause	Fix
`Error: no matching AMI found`	Filter does not match any AMIs	Broaden the filter or check owners field
`Error: no matching VPC found`	Tag filter does not match	Verify tag names and values on the actual VPC
`Error: multiple VPCs matched`	Filter is too broad	Add more filters to narrow to exactly one result
`(known after apply)` in plan	Data source depends on a resource not yet created	Reference the resource directly instead of using a data source lookup
`Error: error reading Secrets Manager secret`	IAM permissions missing	Add `secretsmanager:GetSecretValue` to the IAM role

Terraform Data Sources — Reading Existing Infrastructure Without Managing It

Overview and What You Will Learn

Why This Matters in Production

Core Principles

Detailed Step-by-Step Practical Lab

Part 1 — Data Block Syntax

Part 2 — AMI Data Source (Most Common Use Case)

Part 3 — VPC and Subnet Data Sources (Cross-Team References)

Part 4 — Account and Region Data Sources (Self-Aware Configs)

Part 5 — Security Group and IAM Data Sources

Part 6 — Secrets Manager Data Source (Secrets Without State Exposure)

Part 7 — SSM Parameter Store Data Source

Part 8 — When Data Sources Fetch Their Data

Part 9 — Complete Example: Multi-Team Infrastructure Reference

Production Best Practices and Common Pitfalls

Quick Reference and Troubleshooting Commands

Resources

Explore More in Terraform Fundamentals and Core Workflow

Terraform HCL — Variables, Outputs, Locals, and Expressions Explained

Provisioning AWS Infrastructure with Terraform — VPC, EC2, S3, and RDS

Terraform Tutorial — Providers, Resources, and Your First Infrastructure