Terraform as the IaC Standard Across Hundreds of AWS Accounts
How we standardized Terraform across hundreds of AWS accounts with shared modules, remote state, and workspace conventions
When you manage a handful of AWS accounts, Terraform is straightforward. Write your HCL, run terraform apply, move on. When you manage hundreds of accounts across multiple business units, regions, and environments, Terraform becomes an organizational challenge as much as a technical one.
We have spent the last several months establishing Terraform as the infrastructure-as-code standard across the enterprise. This is what we have learned about making it work at scale.
The Problem With Uncoordinated Terraform
Before we established standards, teams were using Terraform independently. Each team had their own module structure, their own state management approach, their own variable naming conventions. Some teams stored state locally. Others used S3 backends but with inconsistent bucket naming and no locking. A few teams were not using Terraform at all, managing infrastructure through the console or with CloudFormation.
The result was predictable:
- No code reuse: Five teams building VPCs meant five different VPC modules with five different opinions about subnet layouts.
- State management chaos: Lost state files, state corruption from concurrent applies, and no visibility into what Terraform managed across the organization.
- Drift detection impossible: Without consistent state management, we could not tell which resources were managed by Terraform and which were created manually.
- Onboarding friction: Moving between teams meant learning a completely different Terraform workflow.
The Module Library
The foundation of our approach is a centralized module library. We maintain a set of blessed Terraform modules that encode our organizational standards and best practices.
Each module represents an infrastructure pattern:
module "vpc" {
source = "git::https://github.internal/cloud-platform/terraform-modules.git//vpc?ref=v2.1.0"
environment = var.environment
cidr_block = var.vpc_cidr
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = var.private_subnet_cidrs
public_subnets = var.public_subnet_cidrs
enable_nat_gateway = true
single_nat_gateway = var.environment != "production"
enable_flow_logs = true
flow_log_destination = module.logging.log_bucket_arn
tags = local.standard_tags
}
The VPC module encodes decisions that should be consistent everywhere: flow logs are always enabled, subnets follow a predictable layout, tagging is standardized. Individual teams can customize parameters, but the structural decisions are baked in.
We version the modules with semantic versioning. Breaking changes require a major version bump, and teams can pin to specific versions to avoid unexpected changes. This gives us a mechanism for rolling out improvements without forcing every consumer to update simultaneously.
State Management Architecture
State management at this scale requires deliberate architecture. Our approach:
One S3 bucket per account for state storage. Each AWS account has a dedicated state bucket provisioned during account creation. The bucket has versioning enabled, server-side encryption with KMS, and a lifecycle policy that retains previous versions for 90 days.
DynamoDB for state locking. Every state backend configuration includes a DynamoDB table for locking. This prevents concurrent applies from corrupting state, which is a real risk when multiple engineers or CI/CD pipelines operate against the same state.
terraform {
backend "s3" {
bucket = "terraform-state-123456789012"
key = "platform/vpc/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
kms_key_id = "alias/terraform-state"
}
}
State file organization follows a consistent key pattern. The S3 key structure mirrors the logical organization: {team}/{component}/terraform.tfstate. This makes it easy to find state files and understand what they manage.
Workspace Strategy
We evaluated Terraform workspaces for environment separation and decided against them for most use cases. Workspaces share the same configuration and backend, which means a terraform destroy in the wrong workspace can have catastrophic consequences. Instead, we use separate directories and separate state files for each environment.
infrastructure/
production/
vpc/
main.tf
backend.tf
variables.tf
ecs/
main.tf
backend.tf
variables.tf
staging/
vpc/
...
ecs/
...
The trade-off is some code duplication between environments, but the isolation benefit is worth it. A mistake in the staging directory cannot affect production state. We mitigate the duplication with shared variable files and by keeping environment-specific configuration minimal.
CI/CD for Infrastructure
Every Terraform change goes through a pipeline:
- Lint and validate:
terraform fmt -checkandterraform validatecatch syntax issues early. - Plan:
terraform planruns against the target environment and the output is saved as an artifact. - Review: The plan output is posted to the pull request for peer review. Reviewers can see exactly what resources will be created, modified, or destroyed.
- Apply: After approval,
terraform applyruns using the saved plan file, ensuring that what was reviewed is what gets applied.
The plan-and-apply separation is critical. Without it, the infrastructure could change between review and apply, leading to unexpected results. The saved plan file guarantees consistency.
Policy Enforcement
We use Sentinel (HashiCorp's policy-as-code framework) to enforce organizational policies at plan time:
- All resources must have standard tags (team, environment, cost-center).
- Security groups cannot have unrestricted ingress from 0.0.0.0/0.
- S3 buckets must have encryption enabled.
- RDS instances must have multi-AZ enabled in production.
- EC2 instances must use approved AMIs.
Policies that fail block the apply. This shifts compliance enforcement left, catching violations before they reach production rather than detecting them after the fact with audit tools.
Cross-Account References
Applications frequently need to reference resources in other accounts: shared VPCs, centralized logging buckets, security services. We handle this with Terraform remote state data sources and SSM Parameter Store.
For infrastructure managed by the platform team, we publish outputs to SSM Parameter Store in a well-known path structure:
resource "aws_ssm_parameter" "vpc_id" {
name = "/platform/networking/vpc-id"
type = "String"
value = module.vpc.vpc_id
}
Application teams read these parameters:
data "aws_ssm_parameter" "vpc_id" {
name = "/platform/networking/vpc-id"
}
This provides loose coupling between infrastructure layers. The platform team can change VPC implementation details without breaking application configurations, as long as the parameter values remain valid.
Lessons at Scale
Module versioning discipline matters more than module design. A well-designed module with poor versioning practices will cause more pain than a mediocre module with strict semantic versioning. Teams need to trust that updating a module will not break their infrastructure unexpectedly.
State file granularity is a judgment call. Too few state files means long plan/apply cycles and large blast radiuses. Too many means managing hundreds of backend configurations and cross-state references. We aim for one state file per logical component per environment, which typically means 5 to 15 state files per account.
Training is not optional. Terraform has a learning curve, and enterprise conventions add another layer of complexity. We run monthly workshops, maintain a comprehensive internal wiki, and have a dedicated Slack channel for Terraform questions. The investment in training pays for itself in reduced misconfigurations.
Import existing resources before they become orphans. Resources created through the console or CLI should be imported into Terraform management as soon as possible. The longer they exist outside of code, the harder they are to track and the more likely they are to drift from organizational standards.
Five months in, we have over 200 accounts using the standardized Terraform workflow. The module library has grown to cover VPCs, ECS clusters, RDS instances, S3 configurations, IAM roles, and CloudWatch alarms. It is not perfect, but it is consistent, and at enterprise scale, consistency is worth more than perfection.