The thesis: why active/passive
Active/active sounds great in a slide deck. In an actual production system it confronts you with four real costs.
Multi-master DB writes. You're choosing between eventual consistency — which most app business logic isn't built for — or cross-region write quorum latency, typically 50–100ms RTT cross-coast added to every write. Aurora Global Database is asynchronous by design; it doesn't support cross-region writes at all. Most SaaS has a single source of write truth, and pretending otherwise breaks things downstream.
Conflict resolution. Every write path needs idempotency keys, last-write-wins semantics, or CRDTs. That's an app-level rewrite across every endpoint that mutates state — for a failure mode that triggers maybe once a year.
Cache coherency. Invalidation messages need to cross regions. Either you accept stale reads (and the bugs that come with them) or you pay for global cache replication.
Cost. You're running full production capacity in every region, 24/7.
Active/passive accepts a few minutes of failover time and trades it for: a single source of write truth, no app-level changes to support multi-region, and 30–50% lower steady-state cost — depending on how warm DR is kept.
Active/active is the right call when sub-second failover RTO is non-negotiable (finance, ad-tech), the app is genuinely read-heavy and writes are idempotent, you have the engineering bandwidth to maintain conflict-resolution logic across the codebase, or compliance / contractual SLAs require it.
For most B2B SaaS, "users notice 2–5 minutes during an actual regional outage, which happens maybe once every 2–3 years" is a perfectly reasonable trade.
Architecture at a glance
Two regions: a primary and a passive secondary. One gotcha worth flagging up front before you finalize your region pair — not every AWS region has three availability zones. A handful have only two. Before locking in a 3-AZ subnet topology, check the AZ count in your chosen DR region; if you're stuck at 2, your subnet plan, RDS multi-AZ configuration, and any service that requires 3+ AZs all have to adapt.
Primary region runs:
- VPC with 3 AZs, public + private subnets
- Internal ALB
- API Gateway HTTP + VPC Link → ALB
- ECS Fargate services at full capacity
- Aurora PostgreSQL primary cluster (the writer)
- ElastiCache Redis
- Route 53 A record pointing to the primary API Gateway
DR region runs:
- VPC with 2 AZs, public + private subnets
- The same ALB / API Gateway / ECS stack — but ECS is "warm idle" at
desired_count=1,min_capacity=1 - Aurora secondary cluster (Aurora Global Database, async)
- ElastiCache Redis — independent cluster, no cross-region replication
- No Route 53 record under steady state
The single source of truth for "which region serves traffic" is the Route 53 A record. The primary owns it during normal operation. Failover transfers ownership.
Key design decisions worth explaining
Six decisions, because each one represents a path not taken.
1. Aurora Global Database, not logical replication
Storage-layer replication. Sub-second lag under nominal load. No application changes required.
The trade-off: it's an AWS managed feature, with no portability if you ever want to leave AWS. For a single-cloud shop, that simplicity wins over rolling your own logical replication.
The gotcha: encrypted cross-region replication requires an explicit KMS key in the DR region. Aurora can't use the AWS-managed default KMS key for cross-region replication. You will discover this the moment your first terraform apply fails with a cryptic AWS error. Provision a customer-managed KMS key in the DR region, set it on the secondary cluster.
resource "aws_rds_global_cluster" "this" {
global_cluster_identifier = "myapp-prod-global"
engine = "aurora-postgresql"
engine_version = "16.3"
database_name = "app"
storage_encrypted = true
}
# Primary region's stack:
resource "aws_rds_cluster" "primary" {
global_cluster_identifier = aws_rds_global_cluster.this.id
cluster_identifier = "myapp-prod-primary"
engine = aws_rds_global_cluster.this.engine
engine_version = aws_rds_global_cluster.this.engine_version
kms_key_id = aws_kms_key.aurora.arn # primary-region key
# ... writer config
}
# DR region's stack:
resource "aws_rds_cluster" "secondary" {
global_cluster_identifier = "myapp-prod-global" # joins existing
cluster_identifier = "myapp-prod-secondary"
engine = "aurora-postgresql"
engine_version = "16.3"
source_region = var.primary_region
kms_key_id = aws_kms_key.aurora_dr.arn # DR-region key, required
# ... replica config
}
2. Redis is not replicated
ElastiCache Global Datastore exists. We don't use it.
Caches can be cold after failover. The app must tolerate that. Cache miss → DB hit → cache rewarms. ~30 seconds of elevated DB load during warmup is the price.
This is deliberate. The alternative — paying for global Redis 24/7 to avoid a minute of cache miss during a failure that happens annually — doesn't survive a CFO conversation, and it adds operational complexity (cross-region invalidation, replica failover, version skew) that we don't need.
3. Terraform state split: platform vs workloads
One Terraform state per region per layer.
platform: VPC, ALB, API Gateway, Aurora cluster, Redis. Changes rarely.workloads/*: ECS task definitions, service configs. Changes every deploy.
Why this matters for DR: you can roll image updates to the workloads stacks without touching platform. Critical when you're pushing a hotfix to two regions under pressure and don't want VPC routes shifting because someone bumped a Terraform module version.
Cross-stack references go through terraform_remote_state:
data "terraform_remote_state" "base" {
backend = "s3"
config = {
bucket = "myapp-tf-state"
key = "platform/${var.region}/terraform.tfstate"
region = "us-east-1" # wherever your tf state bucket lives
}
}
resource "aws_ecs_service" "api" {
cluster = data.terraform_remote_state.base.outputs.ecs_cluster_id
task_definition = aws_ecs_task_definition.api.arn
# ...
}
4. The IAM namespace problem
IAM is global, not regional. Two regions creating an IAM role with the same name? Collision.
Solution: a name_suffix variable that DR resources append to globally-namespaced things.
variable "name_suffix" {
type = string
default = "" # set to "-dr" in DR region's tfvars
}
resource "aws_iam_role" "task" {
name = "${var.app_name}-${var.env}${var.name_suffix}-task-role"
# ...
}
Local-to-region resources (security groups, target groups, log groups) don't need the suffix. Only IAM and a handful of other global namespaces do. Worth being explicit about which is which in the module's documentation, because the rule isn't intuitive.
5. VPC CIDR non-overlap
Non-overlapping CIDR ranges per region — for example, 10.0.0.0/16 for the primary and 10.1.0.0/16 for DR.
We don't peer them today. We designed for it.
If you ever want to set up Transit Gateway, VPC peering, a Direct Connect that reaches both, or a Site-to-Site VPN that routes to either — overlapping CIDRs make every one of those impossible. Picking non-overlapping ranges up front costs nothing. Picking overlapping ranges and discovering it three years later costs a migration.
6. DNS ownership is binary, not weighted
We considered Route 53 weighted records and health-check-driven automatic failover. We rejected both.
Three reasons:
- Health-check false positives cause flapping. A bad deploy or a noisy ALB metric triggers a switch when nothing is actually wrong regionally.
- Weighted records during partial outages send traffic to a degraded region. If the primary is 70% broken — slow responses, partial 5xx, but ALB health checks still passing — weighted DNS happily routes 70% of users into it.
- A manual gate gives you a deliberate human-in-the-loop checkpoint. Someone with their hands on the keyboard verifies "yes, this is really a regional failure, not our latest deploy."
The trade-off: RTO goes from "automatic, ~30 seconds" to "manual, ~5 minutes." For our risk profile that's the right call. For yours, maybe not — but the choice should be deliberate, not default.
The failover mechanics
The failover script — call it promote-dr.sh — is AWS CLI only. No Terraform.
Why no Terraform during failover:
- Terraform state typically lives in S3 in the primary region. If the primary is down, you can't read state. If you can't read state, you can't apply.
- AWS CLI commands hit the AWS control plane directly per region. They work even with the primary region fully dark.
The script's steps:
aws rds failover-global-cluster— promotes the DR Aurora secondary to a standalone primary writer. This is the irreversible step.- Scale up ECS — bump
desired_countandmin_capacityon DR services from 1 to production levels. - Switch ALB listener default routing if blue/green is in play.
- Route 53 upsert — the DR account/role now owns the A record, pointing it at the DR API Gateway.
- Verify — health-check the new public endpoint, watch error rates, run synthetics.
What the script intentionally doesn't do:
It does not update Terraform state. Drift is accepted during a failover. State reconciliation happens after the dust settles. Trying to keep Terraform aligned with reality during an active outage is how you accidentally terraform destroy something you shouldn't.
Honest RTO timing:
- Aurora failover: 60–120 seconds for global cluster failover-to-promote
- ECS scale-up: 60–180 seconds, depending on image pull + warmup
- DNS propagation: 60-second TTL → most clients see the new IP within 1–2 minutes
Total realistic RTO: 5–10 minutes for a clean failover. Longer if humans are reading runbooks for the first time. Substantially longer if anything unexpected breaks — which is why drills exist.
The hard part: failback
This is what most multi-region writeups gloss over. Failover is the rehearsed-once-a-year event. Failback is the part that gets skipped in DR drills and then surprises you in production.
The state problem after failover:
- The old primary's Aurora cluster is now detached from the global cluster. It's sitting alone with stale data — whatever was in it just before failover, plus no new writes since.
- The global cluster now has DR as its sole member, acting as the primary writer.
- There is nowhere to fail back to until you rebuild the old primary.
The failback sequence:
- Snapshot the stale primary cluster. Safety net. You'll delete it next; keep the snapshot around.
- Delete the primary cluster and its instances. This is the scary step. The cluster you're deleting has data that diverges from the current global truth. You're sure about that snapshot, right?
- Reconfigure the primary's Terraform. Set
is_aurora_secondary = trueandaurora_source_region = var.dr_region. Apply. This recreates it as a new secondary of the global cluster, replicating from DR. - Wait for replication to sync. Minutes to hours, depending on database size and write throughput.
- Planned switchover, not forced failover.
aws rds switchover-global-clustermakes the original primary the writer again. The difference fromfailover-global-cluster: switchover is graceful and reversible; failover is the emergency button. - Restore Terraform vars to normal. Primary = writer, DR = secondary, primary owns the DNS record.
- Apply all stacks in both regions to reconcile drift that accumulated during the incident.
The reason to flag this: failover is the rehearsed event. Failback is the part that turns a recoverable outage into a multi-day operational headache when nobody has done it before. Test it. At least once, end-to-end, in a non-prod environment.
The numbers
Honest, illustrative — your mileage will vary by application:
| Metric | Value |
|---|---|
| RPO (data loss window) | < 5 seconds (Aurora Global async lag) |
| RTO (time to recovery) | 5–10 min clean; up to 30 min messy |
| Steady-state cost premium | ~35–50% over single-region |
| Engineering time per drill | ~2–3 hours quarterly |
The cost premium comes from: the Aurora secondary cluster (full-size, no discount for being passive), warm ECS capacity at minimum, and cross-region data transfer for Aurora replication. The "warm idle" ECS overhead is small in absolute terms; Aurora dominates.
When this pattern isn't enough
The limits, so this doesn't read like a sales pitch:
- Sub-minute RTO requirements. This pattern won't get there. Go active/active or accept downtime.
- Zero data loss (RPO = 0). Aurora Global is asynchronous. If your SLA requires synchronous replication, you're paying for cross-region write latency on every transaction. There's no free lunch.
- Compliance with "no single region of risk." Some regulations effectively require active/active across regions. This pattern alone won't satisfy them.
- Region-specific data residency. If EU data must stay in EU, multi-region within a continent is fine — but a global active/passive setup is not a substitute for a proper data residency architecture.
The closing thought
Multi-region production isn't about preventing the 1-in-1000 region outage from causing downtime. It's about preventing it from causing unbounded downtime.
Active/passive turns "the region is gone, we have no idea when we're back" into "the region is gone, we'll be back in 10 minutes."
That bounded recovery is the actual product.
Avantexa builds this kind of work project-based — multi-region setups, IaC modernization, AWS migrations, and cost reviews. Fixed scope, clean handoff, no retainer. Get in touch.