Active-Active Multi-Region Architecture with Route 53
Designing active-active multi-region deployments with Route 53 failover, health checks, and data replication strategies
When a service goes down at a company that serves hundreds of millions of users, the blast radius is enormous. Not just in revenue, but in brand trust, customer experience, and operational credibility. We have been designing and implementing active-active multi-region architectures on AWS, and it has fundamentally changed how I think about availability.
The goal is straightforward to state and difficult to achieve: the application serves traffic from multiple AWS regions simultaneously, and if an entire region becomes unavailable, traffic shifts to the remaining regions with minimal disruption.
Why Active-Active, Not Active-Passive
The traditional disaster recovery model is active-passive. One region handles all traffic; the other sits idle, waiting for a failover event. This approach has two problems at enterprise scale.
First, the passive region is untested. You do not know if it actually works until you need it, and that is the worst possible time to find out. Failover drills help, but they can never fully replicate the conditions of a real regional failure. The only way to be confident that your secondary region works is to run production traffic through it continuously.
Second, active-passive wastes money. You are paying for infrastructure that sits idle most of the time. In an active-active model, both regions serve traffic, so you get full utilization of your investment. During a regional failure, the surviving region handles the increased load, which requires headroom planning but not an entirely idle duplicate environment.
Route 53 Routing Policies
Route 53 is the foundation of our multi-region traffic management. We use several routing policies depending on the service requirements.
Latency-based routing sends users to the region that provides the lowest latency. For most of our web applications, this means East Coast users hit us-east-1 and West Coast users hit us-west-2. International traffic routes to whichever region is closer.
api.company.com
|
Route 53 (Latency-Based)
| |
us-east-1 us-west-2
ALB ALB
| |
ECS Cluster ECS Cluster
Weighted routing gives us fine-grained control over traffic distribution. During a deployment, we can shift 10% of traffic to the region running the new version, validate metrics, and gradually increase the weight. This is canary deployment at the DNS level.
Failover routing is our safety net. Each endpoint has a health check, and Route 53 automatically removes unhealthy endpoints from DNS responses. The health checks are not simple TCP pings; they hit a deep health endpoint that validates database connectivity, cache availability, and downstream service health.
Health Check Design
A superficial health check that returns 200 as long as the web server is running is worse than no health check at all. It gives you false confidence. Our health endpoints validate:
- Application process is running and accepting requests.
- Database connection pool is healthy and queries execute within acceptable latency.
- Cache (ElastiCache/Redis) is reachable.
- Critical downstream services are responding.
- Disk space and memory are within acceptable thresholds.
{
"status": "healthy",
"checks": {
"database": {"status": "healthy", "latency_ms": 12},
"cache": {"status": "healthy", "latency_ms": 2},
"downstream_api": {"status": "healthy", "latency_ms": 45}
}
}
If any critical check fails, the endpoint returns a 503, and Route 53 stops routing traffic to that region. The threshold is configurable: we typically require 3 consecutive failures before Route 53 removes an endpoint, preventing flapping from transient issues.
Data Replication: The Hard Part
Routing traffic to multiple regions is the easy part. Keeping data consistent across regions is where the real complexity lives.
DynamoDB Global Tables are our preferred solution for data that needs to be read and written in multiple regions. Global Tables provide active-active replication with last-writer-wins conflict resolution. For most of our use cases, the eventual consistency model (typically sub-second replication lag) is acceptable.
For data with stricter consistency requirements, we designate a primary region for writes and use read replicas in secondary regions. The application routes write operations to the primary region regardless of where the read traffic lands. This is a compromise: it means write operations from the secondary region incur cross-region latency, but it avoids the complexity of multi-master conflict resolution.
RDS Cross-Region Read Replicas handle our relational database needs. The primary RDS instance lives in us-east-1, with a read replica in us-west-2. In normal operation, reads are served locally in each region, and writes go to the primary. During a failover, we promote the read replica to a standalone instance.
The promotion process is not instant. It takes several minutes, during which write operations are unavailable. This is an acceptable trade-off for our workloads, but it means truly zero-downtime failover for write-heavy workloads requires DynamoDB or a similar multi-master data store.
S3 Cross-Region Replication handles our static assets, media files, and other object storage. Replication is asynchronous, with typical lag under 15 minutes. For user-facing content, we front S3 with CloudFront, which provides its own caching layer and reduces the impact of replication lag.
Session Management
Stateful sessions break multi-region architectures. If a user's session is stored on an ECS task in us-east-1, and their next request routes to us-west-2, the session is lost. We solved this by externalizing all session state to ElastiCache Redis with cross-region replication.
Every request is self-contained from the application server's perspective. The server reads session data from Redis, processes the request, and writes any session updates back. The Redis cluster replicates changes to the secondary region. If traffic shifts regions, the session data is already there.
Deployment Coordination
Deploying to multiple regions simultaneously is risky. A bad deployment could take down both regions at once, defeating the purpose of multi-region architecture. Our deployment strategy is sequential with validation gates:
- Deploy to us-west-2 (lower traffic).
- Run automated validation: synthetic transactions, error rate monitoring, latency checks.
- Wait 30 minutes. Monitor production metrics.
- Deploy to us-east-1.
- Run the same validation suite.
If validation fails in step 2, we roll back in us-west-2 and the deployment never reaches us-east-1. This ensures that at least one region is always running a known-good version.
Failover Testing
We run quarterly failover drills where we deliberately remove a region from service and validate that the remaining region absorbs the traffic. The drill process:
- Notify stakeholders (this is a planned event, not an incident).
- Set Route 53 health checks for the target region to unhealthy.
- Monitor traffic shift, error rates, and latency in the surviving region.
- Run functional tests against the surviving region.
- Restore the failed region and verify traffic rebalances.
Every drill reveals something: a service that was not properly configured for multi-region, a capacity gap in the surviving region, a monitoring blind spot. The drills are worth every minute of the operational overhead they require.
Cost Considerations
Active-active multi-region roughly doubles your infrastructure cost. Two sets of compute, two sets of databases, cross-region data transfer charges. The business justification has to come from the cost of downtime, which at our scale is measured in millions of dollars per hour for customer-facing services.
Not every service warrants multi-region deployment. Internal tools, batch processing pipelines, and services with relaxed availability requirements run in a single region with standard backup and recovery procedures. Multi-region is reserved for services where regional failure is an unacceptable business risk.
The Honest Assessment
Active-active multi-region is not a checkbox you complete. It is an ongoing operational discipline. Every new feature, every new dependency, every new data store needs to be evaluated through the lens of multi-region compatibility. It adds complexity to development, testing, deployment, and operations.
But when a region has issues and your users do not notice, the investment proves itself. That has happened to us twice in the past six months, and both times, Route 53 failover worked exactly as designed. Traffic shifted, services continued, and we addressed the regional issue without an incident.
That is the peace of mind you are paying for.