AWS Disaster Recovery & Backup Strategy: RTO/RPO, Cross-Region & Restore Testing
Disaster recovery planning is the engineering work nobody wants to do — until the day the primary region goes down and the team scrambles for a runbook that nobody tested. This guide provides a complete, opinionated DR framework for AWS: from setting RTO/RPO targets and choosing the right strategy to automating backups, cross-region replication, and running game days.
TL;DR
"Match your DR strategy to your RTO/RPO budget: Backup & Restore for RPO hours/RTO days (cheapest), Pilot Light for RPO minutes/RTO hours (moderate), Warm Standby for RPO seconds/RTO minutes (higher cost), Multi-Site Active-Active for RPO~0/RTO~0 (highest cost). Automate backups with AWS Backup, test restores quarterly, and run game days twice yearly — untested DR plans fail when you need them most."
Table of Contents
- Why Disaster Recovery Planning Before Production is Non-Negotiable
- RTO & RPO: Setting Realistic Targets
- DR Strategy Spectrum: Backup to Multi-Site
- AWS Backup: Centralized Backup Automation
- Pilot Light DR Architecture
- Warm Standby: Always-On Reduced Capacity
- Multi-Site Active-Active: Zero Downtime
- Data Replication: S3, RDS, DynamoDB, EBS
- DR Testing: Game Days & Automated Runbooks
- DR Cost Optimization & Checklist
1. Why Disaster Recovery Planning Before Production is Non-Negotiable
40% of businesses don't reopen after a major data loss event. For financial services, average downtime costs $9,000–$17,000 per minute. Yet DR planning is routinely deferred until "after launch" — which means it never happens before the first crisis.
The AWS Shared Responsibility Model for DR
AWS provides infrastructure availability (AZ redundancy, hardware fault tolerance, global network). You are responsible for your application's disaster recovery: data backup, cross-region replication, failover automation, and restore testing. AWS going down doesn't excuse your SLA violation.
- AZ failure: Rare but documented (us-east-1 2012, 2019). Multi-AZ deployment handles this — but it's not DR.
- Regional failure: Extremely rare but not impossible (us-east-1 cascading Kinesis failure 2021 affected multiple services simultaneously). Only cross-region DR handles this.
- Data corruption: Accidental DELETE, bug that overwrites data. Backups with point-in-time recovery are your only defense.
- Ransomware / accidental deletion: Same-account backups are vulnerable. Cross-account backup to a dedicated backup account is required for resilience.
DR vs High Availability
High Availability (HA) = redundancy within a region (Multi-AZ RDS, ALB across AZs, Auto Scaling). HA handles hardware failures and AZ outages automatically without human intervention.
Disaster Recovery (DR) = the plan for when an entire region fails, data is corrupted, or a catastrophic event makes the primary environment unusable. DR requires explicit design, automation, and regular testing.
2. RTO & RPO: Setting Realistic Targets
RTO (Recovery Time Objective) is the maximum acceptable downtime after a disaster. RPO (Recovery Point Objective) is the maximum acceptable data loss, measured in time. Both are business decisions, not engineering defaults — get stakeholder sign-off with explicit cost implications.
| Business Tier | RTO | RPO | DR Strategy | Approx. Monthly Cost |
|---|---|---|---|---|
| Tier 1 Critical (payments, auth) | < 1 min | Near 0 | Multi-Site Active-Active | $10,000+ |
| Tier 2 Important (API, core services) | < 15 min | < 1 min | Warm Standby | $2,000–5,000 |
| Tier 3 Standard (reporting, analytics) | < 1 hour | < 15 min | Pilot Light | $500–2,000 |
| Tier 4 Low Priority (batch, dev tools) | < 24 hours | < 4 hours | Backup & Restore | $50–200 |
A useful rule of thumb: each order-of-magnitude improvement in RTO (hours → minutes → seconds) roughly 3–10× the cost of DR infrastructure. Map your services to tiers, get business sign-off on the cost implications, and don't over-engineer DR for Tier 4 systems.
3. DR Strategy Spectrum: Backup to Multi-Site
The AWS Well-Architected Framework defines four DR strategies arranged on a spectrum from cheapest/slowest to most expensive/fastest:
| Strategy | RPO | RTO | Cost Factor | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours | Days | 1x | Low |
| Pilot Light | Minutes | Hours | 3–5x | Medium |
| Warm Standby | Seconds | Minutes | 7–10x | High |
| Multi-Site Active-Active | Near 0 | Near 0 | 20–30x | Very High |
4. AWS Backup: Centralized Backup Automation
AWS Backup is the control plane for all backup operations across AWS services — RDS, DynamoDB, EFS, EBS, S3, FSx, Aurora, DocumentDB, Neptune, EC2 AMIs, and VMware on AWS. Instead of configuring backups per service, define policies centrally.
Key AWS Backup Features
- Backup plan: Define backup rules (schedule, retention period, lifecycle to cold storage, cross-region copy rules).
- Backup vault: Encrypted storage backed by S3 with KMS-managed keys. Vault Lock prevents backup deletion even by root account (WORM compliance).
- Cross-region copy: Automatically copy backups to a DR region (e.g., us-east-1 → us-west-2). Costs $0.02/GB for cross-region data transfer.
- Cross-account backup: Copy to a dedicated backup account in your AWS Organization. This protects against ransomware or accidental deletion in the source account — the most critical protection often overlooked.
- Backup Audit Manager: Compliance reporting and evidence collection for PCI-DSS, HIPAA, SOC2 auditors.
- Lifecycle management: Move backups to S3 Glacier Instant Retrieval after 30 days ($0.004/GB/month), S3 Glacier Deep Archive after 90 days ($0.00099/GB/month).
# Terraform: AWS Backup plan — daily backups, 30-day retention, cross-region copy
resource "aws_backup_plan" "production" {
name = "prod-backup-plan"
rule {
rule_name = "daily-backup"
target_vault_name = aws_backup_vault.primary.name
schedule = "cron(0 2 * * ? *)" # 2 AM UTC daily
start_window = 60
completion_window = 180
lifecycle {
cold_storage_after = 30 # days
delete_after = 365 # days
}
copy_action {
destination_vault_arn = aws_backup_vault.dr.arn
lifecycle {
cold_storage_after = 30
delete_after = 365
}
}
}
}
resource "aws_backup_selection" "production" {
name = "prod-backup-selection"
iam_role_arn = aws_iam_role.backup.arn
plan_id = aws_backup_plan.production.id
selection_tag {
type = "STRINGEQUALS"
key = "BackupEnabled"
value = "true"
}
}
5. Pilot Light DR Architecture
Pilot Light keeps the minimum viable DR infrastructure always running in the secondary region — just enough to bootstrap full capacity when disaster strikes. Think of it as a pilot flame: always burning at low cost, ready to ignite the full furnace.
Always Running in DR Region
- RDS read replica of production database (promotes to primary in 5–15 minutes)
- Route 53 failover records pointing to DR ALB (dormant, health check failing)
- VPC, subnets, security groups, IAM roles — network foundation in place
- ECR container images replicated or accessible from DR region
Stopped/Minimal in DR Region
- EC2 Auto Scaling groups: desired capacity = 0
- ECS services: desired tasks = 0
- ElastiCache: not running (restore from backup on DR activation)
Failover Procedure (SSM Automation)
# AWS CLI: Pilot Light DR Failover Steps
# Step 1: Promote RDS read replica in DR region (~5-15 min)
aws rds promote-read-replica \
--db-instance-identifier prod-db-dr-replica \
--region us-west-2
# Step 2: Scale up ECS services
aws ecs update-service \
--cluster prod-cluster-dr \
--service api-service \
--desired-count 3 \
--region us-west-2
# Step 3: Update Route 53 health check to point to DR endpoint
# (DNS failover handles this automatically when primary fails health check)
# Step 4: Validate application health
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/prod-tg-dr/abc123 \
--region us-west-2
Total RTO breakdown: promote RDS (5–15 min) + ECS scale-out (5–10 min) + application warmup (2–5 min) = 12–30 minutes. Infrastructure as Code is non-negotiable for Pilot Light — the entire DR environment must be deployable with a single Terraform command.
6. Warm Standby: Always-On Reduced Capacity
Warm Standby runs the DR environment continuously at reduced capacity (smaller instance types, fewer tasks). Unlike Pilot Light, there's no cold start delay — the DR environment is always "warm" and can handle traffic immediately.
- Production: 10 × r6g.xlarge ECS tasks in us-east-1
- DR (normal): 2 × r6g.large ECS tasks in us-west-2 (scaled down, validating config works)
- DR (activated): Scale up to 10 × r6g.xlarge ECS tasks in us-west-2
Failover Timeline
Route 53 health check polls every 10 seconds with failure threshold 3 = 30 seconds to detect failure. DNS TTL: 60 seconds. ECS scale-out in DR: 2–3 minutes. Estimated total RTO: 3–5 minutes.
Regularly route 5–10% of production traffic to the DR region as part of Warm Standby operation. This validates that DR actually handles production workloads correctly — not just that the infrastructure is running, but that it performs under load.
- Database: Aurora Global Database provides sub-second replication to DR region. On failover, promote the secondary to writer in <1 minute.
- Sessions: Stateless application + ElastiCache Global Datastore = no session loss on failover. Users stay logged in.
- Cost: DR region typically 20–30% of production cost (running at reduced capacity continuously).
7. Multi-Site Active-Active: Zero Downtime
Multi-Site Active-Active runs identical production deployments in multiple AWS regions simultaneously, serving traffic via Route 53 Latency-Based routing. When a region fails, Route 53 health checks stop routing traffic to it — within 30–60 seconds — with no manual intervention.
The Data Consistency Challenge
Multi-region active-active is architecturally complex because writes happening in multiple regions simultaneously can conflict. Strategies:
- DynamoDB Global Tables: Multi-region active-active with last-writer-wins conflict resolution. Sub-second replication. RPO ~0. Best choice for session data, user profiles, catalog.
- Aurora Global Database: One write region + up to 5 read regions. RPO ~1 second. Failover promotes a reader to writer in <1 minute. Best for relational data needing ACID guarantees.
- Region affinity: Route users to their "home" region via geolocation/cookie. All writes for a user stay in one region. Simplest consistency model, but doesn't eliminate cross-region reads.
# Terraform: DynamoDB Global Table (multi-region active-active)
resource "aws_dynamodb_table" "sessions" {
name = "user-sessions"
billing_mode = "PAY_PER_REQUEST"
hash_key = "userId"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
attribute {
name = "userId"
type = "S"
}
replica {
region_name = "us-east-1"
}
replica {
region_name = "eu-west-1"
}
replica {
region_name = "ap-southeast-1"
}
}
8. Data Replication: S3, RDS, DynamoDB, EBS
S3 Cross-Region Replication (CRR)
Automatic replication of new objects to a DR region bucket. Enable versioning on both source and destination buckets. Use Replication Time Control (RTC) for SLA-backed replication: 99.99% of objects replicated within 15 minutes.
RDS Cross-Region Read Replica
Asynchronous replication to DR region. Typical replication lag <1 second, can spike under heavy write load. Promote to standalone primary on DR: aws rds promote-read-replica. After promotion, update the application's database connection string (Secrets Manager reference).
EBS Snapshot Cross-Region Copy
For EC2 instances, copy AMIs and EBS snapshots to the DR region via AWS Backup or manually. Launch EC2 instances from the copied AMI in the DR region during activation. Snapshots are incremental — only changed blocks are copied after the initial snapshot.
# Terraform: S3 Cross-Region Replication
resource "aws_s3_bucket_replication_configuration" "crr" {
role = aws_iam_role.s3_replication.arn
bucket = aws_s3_bucket.primary.id
rule {
id = "full-replication"
status = "Enabled"
destination {
bucket = aws_s3_bucket.dr.arn
storage_class = "STANDARD_IA"
replication_time {
status = "Enabled"
time { minutes = 15 } # RTC - SLA-backed
}
metrics {
status = "Enabled"
event_threshold { minutes = 15 }
}
}
}
}
# RDS: Create cross-region read replica
resource "aws_db_instance" "dr_replica" {
provider = aws.us_west_2
identifier = "prod-db-dr"
replicate_source_db = "arn:aws:rds:us-east-1:123456789012:db:prod-db"
instance_class = "db.r7g.large"
publicly_accessible = false
skip_final_snapshot = false
deletion_protection = true
tags = { Role = "dr-replica", Environment = "dr" }
}
9. DR Testing: Game Days & Automated Runbooks
DR plans fail in production for one reason: they were never tested. Stale runbooks, dependency drift, team turnover, and changed application behavior all silently invalidate a DR plan over time.
Testing Cadence
- Monthly: Component-level tests — restore a single RDS snapshot, validate ElastiCache failover, test Lambda@Edge edge cases.
- Quarterly: Full DR drill — execute the complete failover procedure in a staging environment. Measure actual RTO vs target. Document gaps.
- Twice yearly: Game Day — full team exercise with production-like conditions. Rotate on-call, announce scope, execute failover, debrief.
AWS Fault Injection Simulator (FIS)
FIS injects controlled failures — AZ outages, EC2 termination, latency injection, API throttling — without manual scripting. Define experiments as code, set blast radius limits, and run in production to validate DR readiness under real traffic.
# AWS Systems Manager Automation: Pilot Light DR Runbook
# Stored in SSM, versioned in Git, requires approval before execution
---
description: "Activate Pilot Light DR - Failover to us-west-2"
schemaVersion: "0.3"
assumeRole: "{{ AutomationAssumeRole }}"
parameters:
AutomationAssumeRole:
type: String
NotificationArn:
type: String
mainSteps:
- name: VerifyPrimaryFailure
action: aws:assertAwsResourceProperty
inputs:
Service: route53
Api: GetHealthCheck
HealthCheckId: "{{ PrimaryHealthCheckId }}"
PropertySelector: "$.HealthCheck.HealthCheckConfig.FailureThreshold"
- name: PromoteRDSReplica
action: aws:executeAwsApi
inputs:
Service: rds
Api: PromoteReadReplica
DBInstanceIdentifier: "prod-db-dr-replica"
- name: WaitForRDSPromotion
action: aws:waitForAwsResourceProperty
inputs:
Service: rds
Api: DescribeDBInstances
DBInstanceIdentifier: "prod-db-dr-replica"
PropertySelector: "$.DBInstances[0].DBInstanceStatus"
DesiredValues: ["available"]
timeoutSeconds: 900
- name: ScaleECSServices
action: aws:executeAwsApi
inputs:
Service: ecs
Api: UpdateService
cluster: "prod-cluster-dr"
service: "api-service"
desiredCount: 3
- name: SendNotification
action: aws:executeAwsApi
inputs:
Service: sns
Api: Publish
TopicArn: "{{ NotificationArn }}"
Message: "DR Activation complete. Primary: us-east-1 (FAILED). Active: us-west-2."
10. DR Cost Optimization & Checklist
Cost Optimization Strategies
- Move backups to S3 Glacier Instant Retrieval after 30 days (90% cost reduction vs S3 Standard)
- Auto Scaling desired capacity = 0 for compute in Pilot Light DR region
- Reserved Instances (1-year) for Warm Standby instances (saves 30–40% vs On-Demand)
- S3 Intelligent-Tiering for objects with unpredictable access patterns during DR periods
- Stop non-critical DR resources (ElastiCache, OpenSearch) until DR activation
Planning & Architecture
- ✅ RTO/RPO targets defined and signed off by business stakeholders
- ✅ Application services tiered (Tier 1–4) with DR strategy per tier
- ✅ DR region selected (ideally diagonally opposite: us-east-1 → us-west-2 or eu-west-1)
- ✅ AWS Organizations structure: dedicated backup account for cross-account backup
- ✅ Compliance requirements documented (PCI-DSS annual DR test, HIPAA contingency plan)
Data Backup
- ✅ AWS Backup plan: daily, 30-day retention, cross-region copy, cross-account copy
- ✅ Backup Vault Lock enabled for ransomware protection (WORM)
- ✅ S3 CRR enabled with RTC for critical buckets
- ✅ RDS read replica in DR region (for Pilot Light and Warm Standby)
- ✅ DynamoDB Global Tables for session/profile data
- ✅ EC2 AMIs copied to DR region for key instances
Failover Procedures
- ✅ SSM Automation runbook: failover procedure as code, versioned in Git
- ✅ Route 53 health checks configured; failover records in place
- ✅ DNS TTL on failover records: 60 seconds
- ✅ Secrets Manager secrets replicated to DR region
- ✅ Container images in ECR replicated or accessible cross-region
Testing & Documentation
- ✅ Restore test quarterly: actually restore RDS snapshot to DR region and validate
- ✅ Full DR drill quarterly: execute failover in staging, measure actual RTO/RPO
- ✅ Game day twice yearly: full team, production-like conditions
- ✅ AWS FIS experiments: AZ failure, instance termination, latency injection
- ✅ Runbook reviewed and updated after every DR test
- ✅ AWS Resilience Hub: resiliency score tracked per application
- No documented runbooks — "we'll figure it out when it happens" always fails
- Single-region database with no read replica — a regional outage means complete data inaccessibility
- Backups in the same AWS account as production — ransomware deletes both simultaneously
- Untested backups — a backup that has never been restored is not a backup
- Manual failover steps — human error under stress adds 30–60+ minutes to RTO
- Using the same Terraform workspace for production and DR — changes can accidentally modify both environments