Overview and What You Will Learn
This lab builds the complete AWS infrastructure for a payment API service — the kind of stack Razorpay or PhonePe runs to process millions of transactions. Every piece is written in Terraform, every resource is production-ready, and the entire stack can be recreated in a new region in under five minutes.
You will build from the network layer up:
- A custom VPC with public and private subnets across two availability zones
- Internet Gateway, NAT Gateway, and route tables for traffic routing
- Security groups with least-privilege ingress and egress rules
- An EC2 instance in the private subnet, bootstrapped via user_data
- An S3 bucket with versioning, encryption, and public access blocked
- An RDS PostgreSQL instance in the private subnet with Multi-AZ option
- IAM role and instance profile so EC2 can talk to S3 and RDS without credentials
- Consistent tagging using locals across every resource
- Outputs for the values your application needs to connect to the infrastructure
Why This Matters in Production
Before Terraform, Razorpay's environment setup required a 47-step Confluence document, took two days for a new engineer to follow, and produced slightly different results every time. With Terraform, the entire environment comes up in eight minutes from terraform apply and is byte-for-byte identical every time.
When a region goes down, the disaster recovery environment is ready in minutes — not days. When a new engineer joins, their dev environment is ready before their first standup. When a security audit asks "show me every resource and its configuration", the answer is git log on the infrastructure repository.
Core Principles
+------------------------------------------+| VPC (10.0.0.0/16) || || Public Subnets Private Subnets|| 10.0.1.0/24 .2.0/24 10.0.11.0/24 || [NAT GW] [ALB] [EC2] [RDS] || | ^ || | outbound only | || v | || Internet Gateway No direct |+------------------------------------------+ | internet v access+------------------------------------------+| Internet (0.0.0.0/0) |+------------------------------------------+Detailed Step-by-Step Practical Lab
The full stack is split across logical files. Create this directory structure:
mkdir razorpay-payments-infra && cd razorpay-payments-infra touch versions.tf # Terraform and provider version constraintstouch provider.tf # AWS provider configuration with default_tagstouch variables.tf # all input variablestouch locals.tf # computed naming and tagging expressionstouch vpc.tf # VPC, subnets, gateways, route tablestouch security.tf # security groupstouch compute.tf # EC2 instances and IAMtouch storage.tf # S3 bucketstouch database.tf # RDS instance and parameter grouptouch outputs.tf # values to expose after applytouch terraform.tfvars # variable values for this environmentStep 1 — Provider and Version Requirements
terraform { required_version = ">= 1.6.0" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } } }# provider.tfprovider "aws" { region = var.aws_region # default_tags applies to EVERY resource this provider creates # No need to repeat tags = {} on every resource block default_tags { tags = { Project = var.project Environment = var.environment ManagedBy = "terraform" Repository = "github.com/razorpay/payments-infra" CostCenter = "platform-${var.environment}" } }}Step 2 — Variables
# variables.tfvariable "aws_region" { description = "AWS region — ap-south-1 for Mumbai, primary region for Razorpay" type = string default = "ap-south-1"} variable "environment" { description = "Deployment environment — controls resource sizing and naming" type = string default = "dev" validation { condition = contains(["dev", "staging", "prod"], var.environment) error_message = "Environment must be dev, staging, or prod." }} variable "project" { description = "Project name prefix — used in all resource names and tags" type = string default = "razorpay-payments"} variable "vpc_cidr" { description = "CIDR block for the VPC — /16 gives 65,536 addresses" type = string default = "10.0.0.0/16"} variable "public_subnet_cidrs" { description = "CIDR blocks for public subnets — one per AZ" type = list(string) default = ["10.0.1.0/24", "10.0.2.0/24"]} variable "private_subnet_cidrs" { description = "CIDR blocks for private subnets — one per AZ, for EC2 and RDS" type = list(string) default = ["10.0.11.0/24", "10.0.12.0/24"]} variable "ec2_instance_type" { description = "EC2 instance type for the application server" type = string default = "t3.medium"} variable "rds_instance_class" { description = "RDS instance class for the PostgreSQL database" type = string default = "db.t3.medium"} variable "rds_allocated_storage" { description = "RDS allocated storage in GB" type = number default = 20} variable "db_master_username" { description = "Master username for the RDS instance" type = string default = "payments_admin"} variable "db_master_password" { description = "Master password for the RDS instance — pass via TF_VAR_db_master_password" type = string sensitive = true} variable "enable_multi_az" { description = "Enable Multi-AZ for RDS — true in prod for high availability" type = bool default = false}# terraform.tfvars — dev environment valuesaws_region = "ap-south-1"environment = "dev"project = "razorpay-payments"vpc_cidr = "10.0.0.0/16"public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24"]private_subnet_cidrs = ["10.0.11.0/24", "10.0.12.0/24"]ec2_instance_type = "t3.micro"rds_instance_class = "db.t3.micro"rds_allocated_storage = 20enable_multi_az = false# db_master_password — pass via: export TF_VAR_db_master_password=...Step 3 — Locals
# locals.tfdata "aws_availability_zones" "available" { state = "available" }data "aws_caller_identity" "current" {}data "aws_region" "current" {} data "aws_ami" "amazon_linux" { most_recent = true owners = ["amazon"] filter { name = "name" values = ["amzn2-ami-hvm-*-x86_64-gp2"] }} locals { name_prefix = "${var.project}-${var.environment}" # razorpay-payments-dev # Use only the first two AZs — enough for HA without over-engineering azs = slice(data.aws_availability_zones.available.names, 0, 2) # S3 bucket name must be globally unique — include account ID s3_data_bucket_name = "${local.name_prefix}-data-${data.aws_caller_identity.current.account_id}"}Step 4 — VPC, Subnets, Gateways, Route Tables
# vpc.tf # ── VPC ───────────────────────────────────────────────────────────────────resource "aws_vpc" "main" { cidr_block = var.vpc_cidr enable_dns_support = true # required for RDS endpoint DNS resolution enable_dns_hostnames = true # EC2 instances get DNS hostnames tags = { Name = "${local.name_prefix}-vpc" }} # ── Public Subnets (NAT GW and ALB live here) ─────────────────────────────resource "aws_subnet" "public" { count = length(var.public_subnet_cidrs) vpc_id = aws_vpc.main.id cidr_block = var.public_subnet_cidrs[count.index] availability_zone = local.azs[count.index] map_public_ip_on_launch = true # EC2 in public subnet gets a public IP tags = { Name = "${local.name_prefix}-public-${local.azs[count.index]}", Tier = "public" }} # ── Private Subnets (EC2 app servers and RDS) ─────────────────────────────resource "aws_subnet" "private" { count = length(var.private_subnet_cidrs) vpc_id = aws_vpc.main.id cidr_block = var.private_subnet_cidrs[count.index] availability_zone = local.azs[count.index] # map_public_ip_on_launch = false (default) — private subnet, no public IPs tags = { Name = "${local.name_prefix}-private-${local.azs[count.index]}", Tier = "private" }} # ── Internet Gateway (public subnet → internet) ───────────────────────────resource "aws_internet_gateway" "main" { vpc_id = aws_vpc.main.id tags = { Name = "${local.name_prefix}-igw" }} # ── Elastic IP for NAT Gateway ─────────────────────────────────────────────# NAT Gateway needs a static public IPresource "aws_eip" "nat" { domain = "vpc" # allocate in VPC address pool depends_on = [aws_internet_gateway.main] # IGW must exist before EIP can be used tags = { Name = "${local.name_prefix}-nat-eip" }} # ── NAT Gateway (private subnet → internet, one-way) ─────────────────────# Sits in the PUBLIC subnet — private subnet route table points to itresource "aws_nat_gateway" "main" { allocation_id = aws_eip.nat.id subnet_id = aws_subnet.public[0].id # NAT GW goes in the first public subnet depends_on = [aws_internet_gateway.main] tags = { Name = "${local.name_prefix}-nat-gw" }} # ── Public Route Table ────────────────────────────────────────────────────resource "aws_route_table" "public" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" # all traffic gateway_id = aws_internet_gateway.main.id # goes through IGW } tags = { Name = "${local.name_prefix}-public-rt" }} resource "aws_route_table_association" "public" { count = length(aws_subnet.public) subnet_id = aws_subnet.public[count.index].id route_table_id = aws_route_table.public.id} # ── Private Route Table ───────────────────────────────────────────────────resource "aws_route_table" "private" { vpc_id = aws_vpc.main.id route { cidr_block = "0.0.0.0/0" # outbound internet nat_gateway_id = aws_nat_gateway.main.id # via NAT GW, not IGW } tags = { Name = "${local.name_prefix}-private-rt" }} resource "aws_route_table_association" "private" { count = length(aws_subnet.private) subnet_id = aws_subnet.private[count.index].id route_table_id = aws_route_table.private.id}Step 5 — Security Groups
# security.tf # ── Application Load Balancer security group ──────────────────────────────resource "aws_security_group" "alb" { name = "${local.name_prefix}-alb-sg" description = "ALB: HTTPS from internet, all outbound to app" vpc_id = aws_vpc.main.id ingress { description = "HTTPS from internet" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { description = "HTTP — redirect to HTTPS" from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { description = "All outbound — to app servers on port 8080" from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } tags = { Name = "${local.name_prefix}-alb-sg" }} # ── Application server security group ─────────────────────────────────────resource "aws_security_group" "app" { name = "${local.name_prefix}-app-sg" description = "App server: inbound from ALB only, outbound to RDS and internet" vpc_id = aws_vpc.main.id ingress { description = "Application traffic from ALB only" from_port = 8080 to_port = 8080 protocol = "tcp" security_groups = [aws_security_group.alb.id] # reference by SG, not CIDR } egress { description = "All outbound — to RDS, S3 endpoints, internet" from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } tags = { Name = "${local.name_prefix}-app-sg" }} # ── RDS security group ────────────────────────────────────────────────────resource "aws_security_group" "rds" { name = "${local.name_prefix}-rds-sg" description = "RDS: PostgreSQL inbound from app servers only" vpc_id = aws_vpc.main.id ingress { description = "PostgreSQL from app servers only — not from internet" from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.app.id] # app SG only, no CIDR } # No egress rule needed for RDS — RDS does not initiate connections tags = { Name = "${local.name_prefix}-rds-sg" }}Step 6 — IAM Role for EC2
# compute.tf (IAM section) # Trust policy: allow EC2 service to assume this roledata "aws_iam_policy_document" "ec2_assume_role" { statement { effect = "Allow" actions = ["sts:AssumeRole"] principals { type = "Service" identifiers = ["ec2.amazonaws.com"] } }} # The IAM role — EC2 instances will use this for all AWS API callsresource "aws_iam_role" "app" { name = "${local.name_prefix}-ec2-role" assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json} # Policy document: what the EC2 instance is allowed to dodata "aws_iam_policy_document" "app_permissions" { # S3 access — read and write to the app data bucket only statement { sid = "S3DataBucket" effect = "Allow" actions = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"] resources = [ aws_s3_bucket.app_data.arn, "${aws_s3_bucket.app_data.arn}/*" ] } # Secrets Manager — read the DB password statement { sid = "SecretsManagerRead" effect = "Allow" actions = ["secretsmanager:GetSecretValue"] resources = [ "arn:aws:secretsmanager:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:secret:${local.name_prefix}/*" ] } # CloudWatch Logs — write application logs statement { sid = "CloudWatchLogs" effect = "Allow" actions = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"] resources = ["arn:aws:logs:*:*:*"] }} # Inline policy attached to the roleresource "aws_iam_role_policy" "app" { name = "${local.name_prefix}-app-policy" role = aws_iam_role.app.id policy = data.aws_iam_policy_document.app_permissions.json} # SSM — attach managed policy for Systems Manager access (no bastion needed)resource "aws_iam_role_policy_attachment" "ssm" { role = aws_iam_role.app.name policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"} # Instance profile — wraps the role for EC2 to useresource "aws_iam_instance_profile" "app" { name = "${local.name_prefix}-ec2-profile" role = aws_iam_role.app.name}Step 7 — EC2 Instance
# compute.tf (EC2 section) resource "aws_instance" "app" { ami = data.aws_ami.amazon_linux.id instance_type = var.ec2_instance_type subnet_id = aws_subnet.private[0].id # private subnet vpc_security_group_ids = [aws_security_group.app.id] iam_instance_profile = aws_iam_instance_profile.app.name # user_data: shell script that runs on first boot # Use <<-EOF (with hyphen) to allow indentation user_data = <<-EOF #!/bin/bash set -e # exit if any command fails # Update all packages yum update -y # Install CloudWatch agent for log shipping yum install -y amazon-cloudwatch-agent # Install the application # (Replace with your actual deployment commands) echo "Environment: ${var.environment}" > /opt/app/config.env echo "Project: ${var.project}" >> /opt/app/config.env # Start the application systemctl start app || true EOF # Require IMDSv2 — prevents SSRF attacks that steal instance metadata metadata_options { http_endpoint = "enabled" http_tokens = "required" # IMDSv2 only — blocks IMDSv1 http_put_response_hop_limit = 1 } root_block_device { volume_type = "gp3" volume_size = 30 # GB encrypted = true # always encrypt root volume delete_on_termination = true } tags = { Name = "${local.name_prefix}-app-server" }}Step 8 — S3 Bucket
# storage.tf resource "aws_s3_bucket" "app_data" { bucket = local.s3_data_bucket_name # globally unique via account ID suffix tags = { Name = "${local.name_prefix}-data" }} # Block all public access — application data must never be publicresource "aws_s3_bucket_public_access_block" "app_data" { bucket = aws_s3_bucket.app_data.id block_public_acls = true block_public_policy = true ignore_public_acls = true restrict_public_buckets = true} # Versioning — recover deleted or overwritten objectsresource "aws_s3_bucket_versioning" "app_data" { bucket = aws_s3_bucket.app_data.id versioning_configuration { status = var.environment == "prod" ? "Enabled" : "Suspended" }} # Server-side encryption — all objects encrypted at restresource "aws_s3_bucket_server_side_encryption_configuration" "app_data" { bucket = aws_s3_bucket.app_data.id rule { apply_server_side_encryption_by_default { sse_algorithm = "AES256" # SSE-S3 — free, no KMS charges } bucket_key_enabled = true # reduces KMS API calls if you upgrade to SSE-KMS }} # Lifecycle — move old objects to cheaper storage, delete very old onesresource "aws_s3_bucket_lifecycle_configuration" "app_data" { bucket = aws_s3_bucket.app_data.id rule { id = "transition-and-expire" status = "Enabled" transition { days = 30 storage_class = "STANDARD_IA" # cheaper storage after 30 days } transition { days = 90 storage_class = "GLACIER" # archive after 90 days } expiration { days = 365 # delete after one year — adjust to your data retention policy } }}Step 9 — RDS PostgreSQL
# database.tf # Subnet group — tells RDS which subnets it can create its ENI inresource "aws_db_subnet_group" "main" { name = "${local.name_prefix}-db-subnet-group" description = "Subnet group for ${local.name_prefix} PostgreSQL" subnet_ids = aws_subnet.private[*].id # all private subnets tags = { Name = "${local.name_prefix}-db-subnet-group" }} # Parameter group — PostgreSQL configuration (logging, timeouts, etc.)resource "aws_db_parameter_group" "postgres14" { name = "${local.name_prefix}-pg14" family = "postgres14" description = "Custom PostgreSQL 14 parameters for ${local.name_prefix}" parameter { name = "log_statement" value = "ddl" # log DDL statements (CREATE, ALTER, DROP) } parameter { name = "log_min_duration_statement" value = "1000" # log queries taking over 1 second — catch slow queries } parameter { name = "log_connections" value = "1" # log new database connections } tags = { Name = "${local.name_prefix}-pg14-params" }} # RDS PostgreSQL instanceresource "aws_db_instance" "main" { identifier = "${local.name_prefix}-postgres" # Engine engine = "postgres" engine_version = "14.9" parameter_group_name = aws_db_parameter_group.postgres14.name # Sizing instance_class = var.rds_instance_class allocated_storage = var.rds_allocated_storage storage_type = "gp3" # faster and cheaper than gp2 max_allocated_storage = var.rds_allocated_storage * 3 # auto-scale up to 3x # Database db_name = "payments" # initial database to create username = var.db_master_username password = var.db_master_password # marked sensitive in variable block # Networking db_subnet_group_name = aws_db_subnet_group.main.name vpc_security_group_ids = [aws_security_group.rds.id] publicly_accessible = false # private subnet only — never public # High Availability multi_az = var.enable_multi_az # true in prod, false in dev # Backup backup_retention_period = var.environment == "prod" ? 30 : 3 backup_window = "03:00-04:00" # 3-4am UTC (8:30-9:30am IST) maintenance_window = "sun:04:00-sun:05:00" # Sunday 4-5am UTC # Security storage_encrypted = true # always encrypt data at rest deletion_protection = var.environment == "prod" ? true : false # Snapshot on destroy — safety net skip_final_snapshot = var.environment != "prod" final_snapshot_identifier = "${local.name_prefix}-final-snapshot" # Performance Insights performance_insights_enabled = true performance_insights_retention_period = 7 # days lifecycle { # Never let Terraform replace the database due to a password change # Rotate the password manually via RDS console or Secrets Manager ignore_changes = [password] } tags = { Name = "${local.name_prefix}-postgres" }}Step 10 — Outputs
# outputs.tf output "vpc_id" { description = "VPC ID — use in other Terraform configurations via remote state" value = aws_vpc.main.id} output "public_subnet_ids" { description = "Public subnet IDs — for ALB and NAT Gateway" value = aws_subnet.public[*].id} output "private_subnet_ids" { description = "Private subnet IDs — for EC2 and RDS" value = aws_subnet.private[*].id} output "app_security_group_id" { description = "Security group ID for application servers" value = aws_security_group.app.id} output "instance_id" { description = "EC2 instance ID for the application server" value = aws_instance.app.id} output "instance_private_ip" { description = "Private IP of the application server — for health checks" value = aws_instance.app.private_ip} output "s3_bucket_name" { description = "S3 bucket name — set as APP_DATA_BUCKET env var in your application" value = aws_s3_bucket.app_data.id} output "s3_bucket_arn" { description = "S3 bucket ARN — use in IAM policy resource fields" value = aws_s3_bucket.app_data.arn} output "rds_endpoint" { description = "RDS endpoint hostname — set as DB_HOST env var in your application" value = aws_db_instance.main.address} output "rds_port" { description = "RDS port — PostgreSQL default is 5432" value = aws_db_instance.main.port} output "rds_database_name" { description = "Database name — set as DB_NAME env var in your application" value = aws_db_instance.main.db_name}Step 11 — Apply the Full Stack
# Export the database password before running applyexport TF_VAR_db_master_password="R@zorpay2024Secure" # Initialise providers and backendterraform init # Validate configuration — catches syntax errors before any API callsterraform validate # Preview the full stack — read this carefully before applyingterraform plan -out=tfplan.binary # Apply — creates ~18 resources in the right dependency orderterraform apply tfplan.binary # Expected output (abridged):# aws_vpc.main: Creating...# aws_vpc.main: Creation complete after 3s [id=vpc-0a1b2c3d]# aws_subnet.public[0]: Creating...# aws_subnet.public[1]: Creating...# aws_subnet.private[0]: Creating...# ... (parallel where dependencies allow)# aws_db_instance.main: Still creating... [3m0s elapsed]# aws_db_instance.main: Creation complete after 4m12s## Apply complete! Resources: 18 added, 0 changed, 0 destroyed.## Outputs:# rds_endpoint = "razorpay-payments-dev-postgres.abc123.ap-south-1.rds.amazonaws.com"# s3_bucket_name = "razorpay-payments-dev-data-123456789012"# instance_id = "i-0a1b2c3d4e5f"# vpc_id = "vpc-0a1b2c3d"Step 12 — Apply for Production
The same configuration, different values:
# prod.tfvarscat > prod.tfvars << 'EOF'environment = "prod"ec2_instance_type = "t3.large"rds_instance_class = "db.r6g.large"rds_allocated_storage = 500enable_multi_az = trueEOF export TF_VAR_db_master_password="ProductionSecretHere" terraform apply -var-file=prod.tfvarsProduction Best Practices and Common Pitfalls
Never put the RDS password in
terraform.tfvars. Useexport TF_VAR_db_master_password=...before running apply, or read from AWS Secrets Manager using a data source. The.tfvarsfile goes into Git — the password must not.Use
lifecycle { ignore_changes = [password] }on RDS. If you rotate the RDS password through the console or Secrets Manager rotation, Terraform will try to revert it on the next plan unless you ignore that attribute.Use
deletion_protection = trueon production RDS. This preventsterraform destroyfrom deleting the database. Even if an engineer runs destroy by mistake in production, the database will be protected.Use
skip_final_snapshot = falseon production RDS. When you eventually delete the database intentionally, you want a final snapshot as a safety net.final_snapshot_identifiermust also be set.Set
http_tokens = requiredon EC2 instances. This enforces IMDSv2, which prevents Server-Side Request Forgery (SSRF) attacks from stealing the instance's IAM credentials via the metadata service.Enable storage encryption on both EC2 root volumes and RDS.
storage_encrypted = trueon RDS andencrypted = trueinroot_block_deviceare not defaults — you must set them explicitly.Use
gp3for both EC2 and RDS storage.gp3is cheaper and faster thangp2and allows you to independently set IOPS and throughput. There is no reason to usegp2for new resources.One NAT Gateway per AZ for production. This lab uses a single NAT Gateway to keep costs down. In production, use one NAT Gateway per AZ so a single AZ failure does not take down outbound internet access for the other AZ.
Quick Reference and Troubleshooting Commands
| Command | What It Does |
|---|---|
terraform plan -target=aws_vpc.main |
Plan only the VPC resource |
terraform apply -var-file=prod.tfvars |
Apply with production variable values |
terraform state list |
Show all 18+ resources in state |
terraform state show aws_db_instance.main |
Show all RDS attributes including endpoint |
terraform output rds_endpoint |
Read the RDS endpoint after apply |
terraform destroy -target=aws_instance.app |
Destroy only the EC2 instance |
| Error | Root Cause | Fix |
|---|---|---|
InvalidSubnetID.NotFound |
Subnet created in different AZ than expected | Check availability_zone argument on subnet |
DBSubnetGroupDoesNotCoverEnoughAZs |
RDS subnet group needs subnets in ≥ 2 AZs | Add subnets from at least two AZs to the subnet group |
InvalidParameterValue: Multi-AZ is not supported |
Wrong instance class for Multi-AZ | Check RDS instance class supports Multi-AZ |
Error: creating S3 bucket: BucketAlreadyOwnedByYou |
Bucket name already exists in your account | Use the account ID suffix pattern: local.s3_data_bucket_name |
| NAT Gateway creation fails | EIP quota exceeded | Request EIP quota increase or release unused EIPs |
| RDS takes 5+ minutes to create | Normal — RDS provisioning is slow | Wait; add -parallelism=20 to speed up other resources |