|8 min read

VPC Networking at Enterprise Scale

Designing VPC topologies, Transit Gateway patterns, peering strategies, and security group architectures for hundreds of AWS accounts

Networking is the invisible infrastructure that makes everything else possible. When it works, nobody thinks about it. When it breaks, nothing else matters. At a major entertainment company operating hundreds of AWS accounts, VPC networking is one of the most critical architectural decisions we make, and one of the hardest to change after the fact.

I have spent the better part of this year designing and implementing VPC networking patterns that scale across our organization. Here is what enterprise-scale AWS networking actually looks like.

The Multi-Account Reality

We operate hundreds of AWS accounts. Each business unit has its own set of accounts, typically organized as development, staging, and production. Shared services (logging, monitoring, security tooling) live in dedicated accounts. The networking account manages Transit Gateway, Direct Connect, and DNS.

This account structure provides blast radius isolation: a misconfiguration in one business unit's development account cannot affect another unit's production environment. But it also creates a connectivity challenge. Applications need to communicate across accounts, and that communication needs to be controlled, secure, and performant.

VPC Design Principles

Every VPC in our environment follows a standard design:

CIDR allocation is centralized. We maintain a master IP address management (IPAM) spreadsheet (soon to be replaced by a proper IPAM tool) that tracks every VPC CIDR block across every account. Overlapping CIDRs make peering and routing impossible, so preventing overlap is a hard requirement. We allocate /16 blocks per business unit and subdivide into /20 or /22 blocks per VPC.

Three-tier subnet architecture. Every VPC has public, private, and data subnets across three availability zones:

VPC: 10.100.0.0/20

Public Subnets (ALBs, NAT Gateways):
  10.100.0.0/24  (AZ-a)
  10.100.1.0/24  (AZ-b)
  10.100.2.0/24  (AZ-c)

Private Subnets (Application Servers, ECS Tasks):
  10.100.4.0/22  (AZ-a)
  10.100.8.0/22  (AZ-b)
  10.100.12.0/22 (AZ-c)

Data Subnets (RDS, ElastiCache):
  10.100.3.0/26  (AZ-a)
  10.100.3.64/26 (AZ-b)
  10.100.3.128/26 (AZ-c)

Public subnets are small because they only host load balancers and NAT gateways. Private subnets are large because they host the actual workloads. Data subnets are moderate because database instances are fewer but need room for growth.

NAT Gateway strategy. Production VPCs get one NAT Gateway per AZ for high availability. Non-production VPCs share a single NAT Gateway to reduce cost. This is configured in our standard Terraform VPC module and controlled by a single variable.

Transit Gateway: The Hub

Before Transit Gateway, connecting VPCs meant VPC peering. Peering is a point-to-point connection: VPC A peers with VPC B, VPC B peers with VPC C, but A cannot reach C through B (peering is non-transitive). At our scale, managing pairwise peering connections was untenable. With N VPCs, you need up to N*(N-1)/2 peering connections. For 100 VPCs, that is nearly 5,000 connections.

Transit Gateway solves this with a hub-and-spoke model. Every VPC attaches to the Transit Gateway, and the Transit Gateway handles routing between them.

         VPC-A    VPC-B    VPC-C
           \       |       /
            \      |      /
         Transit Gateway
            /      |      \
           /       |       \
         VPC-D   VPC-E   Shared Services

Each VPC attachment adds a route in the Transit Gateway route table, and the Transit Gateway forwards traffic between VPCs based on those routes.

Route Table Segmentation

Not every VPC should be able to reach every other VPC. We use Transit Gateway route tables to implement network segmentation:

  • Production route table: Production VPCs can reach shared services and other production VPCs but cannot reach development or staging.
  • Non-production route table: Development and staging VPCs can reach shared services and each other but cannot reach production.
  • Shared services route table: Shared services VPCs are accessible from both production and non-production.

This segmentation is enforced at the network layer. Even if an application in a development account has credentials for a production database, the network path does not exist. Defense in depth.

Direct Connect

Our on-premises data centers connect to AWS through AWS Direct Connect, providing dedicated network connectivity with consistent latency and throughput. We operate two Direct Connect connections from two different facilities for redundancy.

Direct Connect traffic flows through the Transit Gateway, which means on-premises systems can reach any VPC attached to the Transit Gateway (subject to route table rules). This is significantly simpler than the previous architecture, where Direct Connect virtual interfaces had to be mapped to individual VPCs.

Data Center --- Direct Connect --- Transit Gateway --- VPCs
                   (x2 for HA)

The Direct Connect bandwidth is 10 Gbps per connection. For bulk data transfers (media files, database migrations), this dedicated bandwidth is essential. Pushing terabytes of data over the public internet is slow, expensive, and unreliable.

Security Groups: Layered Defense

Security groups are our primary network access control mechanism, and managing them at scale requires discipline.

Standard security group patterns. Every application gets a set of security groups that follow a standard naming convention and rule structure:

# ALB security group: allows inbound HTTPS from Akamai IPs
resource "aws_security_group" "alb" {
  name_prefix = "${var.app_name}-alb-"
  vpc_id      = var.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.akamai_ip_ranges
  }
}

# Application security group: allows inbound from ALB only
resource "aws_security_group" "app" {
  name_prefix = "${var.app_name}-app-"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }
}

# Database security group: allows inbound from app only
resource "aws_security_group" "db" {
  name_prefix = "${var.app_name}-db-"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }
}

Each layer can only be reached from the layer above it. The database cannot be accessed directly from the internet, even if every other security control fails. The ALB only accepts traffic from known CDN IP ranges. The application only accepts traffic from the ALB.

Security group referencing over CIDR blocks. Wherever possible, we reference security groups by ID rather than CIDR blocks. This makes rules dynamic: if the application scales and new instances launch with the app security group, they automatically gain access to the database. No rule updates needed.

No 0.0.0.0/0 ingress. This is enforced by Sentinel policy in our Terraform pipeline. No security group in our environment allows unrestricted inbound access from the internet. Exceptions require a security review and are implemented with explicit CIDR ranges (CDN IPs, VPN IPs, partner IPs).

DNS Architecture

We use Route 53 private hosted zones for internal DNS. Each VPC has access to a private hosted zone that resolves internal service names:

app-api.internal.corp      resolves to  10.100.5.23
shared-cache.internal.corp resolves to  10.200.1.45
auth-service.internal.corp resolves to  10.150.3.12

The private hosted zones are associated with multiple VPCs, so services in different VPCs can resolve each other's names. This is simpler than managing IP addresses in configuration files and adapts automatically as services scale or move.

Route 53 Resolver handles DNS forwarding between AWS and on-premises. On-premises systems can resolve AWS private DNS names, and AWS workloads can resolve on-premises DNS names. This bidirectional resolution is essential for hybrid environments where applications span both worlds.

Network Monitoring

Visibility into network behavior is non-negotiable at this scale:

  • VPC Flow Logs: Every VPC ships flow logs to a centralized S3 bucket in the logging account. Flow logs capture source IP, destination IP, port, protocol, and action (accept/reject) for every network connection. We use Athena to query flow logs for security investigations and capacity planning.
  • Transit Gateway metrics: CloudWatch metrics on Transit Gateway show bytes in/out, packets in/out, and packet drop counts per attachment. Spikes in packet drops indicate routing issues or capacity constraints.
  • Direct Connect monitoring: CloudWatch metrics on Direct Connect connections track utilization, errors, and availability. We alert when utilization exceeds 70% to trigger capacity planning conversations.

Lessons Learned

CIDR planning is forever. The CIDR blocks you choose today will constrain your networking options for years. Allocate generously. Running out of IP space in a VPC means either complex workarounds (secondary CIDRs, NAT) or rebuilding the VPC from scratch. Neither is pleasant.

Transit Gateway simplified our lives dramatically. The transition from pairwise peering to Transit Gateway reduced our networking complexity by an order of magnitude. If you are managing more than a dozen VPCs, Transit Gateway is not optional.

Security groups are your most important security control. IAM policies control who can manage resources. Security groups control which resources can communicate. Both are essential, but security groups prevent entire classes of lateral movement attacks that IAM alone cannot address.

Document your network topology. Complex networking configurations that exist only in Terraform code and the minds of the networking team are a bus-factor risk. We maintain architecture diagrams that show VPC topologies, Transit Gateway attachments, and Direct Connect paths. They are updated with every significant change.

Networking is the foundation. Get it right early, document it thoroughly, and invest in monitoring. Everything else you build on AWS depends on it.

Share: