System Design

AWS Disaster Recovery & Backup Strategy: RTO/RPO, Cross-Region & Restore Testing

Disaster recovery planning is the engineering work nobody wants to do — until the day the primary region goes down and the team scrambles for a runbook that nobody tested. This guide provides a complete, opinionated DR framework for AWS: from setting RTO/RPO targets and choosing the right strategy to automating backups, cross-region replication, and running game days.

Md Sanwar Hossain April 7, 2026 23 min read AWS Reliability
AWS disaster recovery backup strategy RTO RPO cross-region restore testing

TL;DR

"Match your DR strategy to your RTO/RPO budget: Backup & Restore for RPO hours/RTO days (cheapest), Pilot Light for RPO minutes/RTO hours (moderate), Warm Standby for RPO seconds/RTO minutes (higher cost), Multi-Site Active-Active for RPO~0/RTO~0 (highest cost). Automate backups with AWS Backup, test restores quarterly, and run game days twice yearly — untested DR plans fail when you need them most."

Table of Contents

  1. Why Disaster Recovery Planning Before Production is Non-Negotiable
  2. RTO & RPO: Setting Realistic Targets
  3. DR Strategy Spectrum: Backup to Multi-Site
  4. AWS Backup: Centralized Backup Automation
  5. Pilot Light DR Architecture
  6. Warm Standby: Always-On Reduced Capacity
  7. Multi-Site Active-Active: Zero Downtime
  8. Data Replication: S3, RDS, DynamoDB, EBS
  9. DR Testing: Game Days & Automated Runbooks
  10. DR Cost Optimization & Checklist

1. Why Disaster Recovery Planning Before Production is Non-Negotiable

40% of businesses don't reopen after a major data loss event. For financial services, average downtime costs $9,000–$17,000 per minute. Yet DR planning is routinely deferred until "after launch" — which means it never happens before the first crisis.

The AWS Shared Responsibility Model for DR

AWS provides infrastructure availability (AZ redundancy, hardware fault tolerance, global network). You are responsible for your application's disaster recovery: data backup, cross-region replication, failover automation, and restore testing. AWS going down doesn't excuse your SLA violation.

DR vs High Availability

High Availability (HA) = redundancy within a region (Multi-AZ RDS, ALB across AZs, Auto Scaling). HA handles hardware failures and AZ outages automatically without human intervention.

Disaster Recovery (DR) = the plan for when an entire region fails, data is corrupted, or a catastrophic event makes the primary environment unusable. DR requires explicit design, automation, and regular testing.

2. RTO & RPO: Setting Realistic Targets

RTO (Recovery Time Objective) is the maximum acceptable downtime after a disaster. RPO (Recovery Point Objective) is the maximum acceptable data loss, measured in time. Both are business decisions, not engineering defaults — get stakeholder sign-off with explicit cost implications.

Business Tier RTO RPO DR Strategy Approx. Monthly Cost
Tier 1 Critical (payments, auth) < 1 min Near 0 Multi-Site Active-Active $10,000+
Tier 2 Important (API, core services) < 15 min < 1 min Warm Standby $2,000–5,000
Tier 3 Standard (reporting, analytics) < 1 hour < 15 min Pilot Light $500–2,000
Tier 4 Low Priority (batch, dev tools) < 24 hours < 4 hours Backup & Restore $50–200

A useful rule of thumb: each order-of-magnitude improvement in RTO (hours → minutes → seconds) roughly 3–10× the cost of DR infrastructure. Map your services to tiers, get business sign-off on the cost implications, and don't over-engineer DR for Tier 4 systems.

3. DR Strategy Spectrum: Backup to Multi-Site

The AWS Well-Architected Framework defines four DR strategies arranged on a spectrum from cheapest/slowest to most expensive/fastest:

AWS disaster recovery strategies comparison: backup, pilot light, warm standby, multi-site
AWS DR Strategy Spectrum — Backup & Restore through Multi-Site Active-Active, trading cost for RTO/RPO. Source: mdsanwarhossain.me
Strategy RPO RTO Cost Factor Complexity
Backup & Restore Hours Days 1x Low
Pilot Light Minutes Hours 3–5x Medium
Warm Standby Seconds Minutes 7–10x High
Multi-Site Active-Active Near 0 Near 0 20–30x Very High

4. AWS Backup: Centralized Backup Automation

AWS Backup is the control plane for all backup operations across AWS services — RDS, DynamoDB, EFS, EBS, S3, FSx, Aurora, DocumentDB, Neptune, EC2 AMIs, and VMware on AWS. Instead of configuring backups per service, define policies centrally.

Key AWS Backup Features

# Terraform: AWS Backup plan — daily backups, 30-day retention, cross-region copy
resource "aws_backup_plan" "production" {
  name = "prod-backup-plan"

  rule {
    rule_name         = "daily-backup"
    target_vault_name = aws_backup_vault.primary.name
    schedule          = "cron(0 2 * * ? *)"  # 2 AM UTC daily
    start_window      = 60
    completion_window = 180

    lifecycle {
      cold_storage_after = 30  # days
      delete_after       = 365 # days
    }

    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn
      lifecycle {
        cold_storage_after = 30
        delete_after       = 365
      }
    }
  }
}

resource "aws_backup_selection" "production" {
  name         = "prod-backup-selection"
  iam_role_arn = aws_iam_role.backup.arn
  plan_id      = aws_backup_plan.production.id

  selection_tag {
    type  = "STRINGEQUALS"
    key   = "BackupEnabled"
    value = "true"
  }
}
AWS Backup centralized automation architecture
AWS Backup Centralized Architecture — primary account backups copied cross-region and cross-account. Source: mdsanwarhossain.me

5. Pilot Light DR Architecture

Pilot Light keeps the minimum viable DR infrastructure always running in the secondary region — just enough to bootstrap full capacity when disaster strikes. Think of it as a pilot flame: always burning at low cost, ready to ignite the full furnace.

Always Running in DR Region

Stopped/Minimal in DR Region

Failover Procedure (SSM Automation)

# AWS CLI: Pilot Light DR Failover Steps

# Step 1: Promote RDS read replica in DR region (~5-15 min)
aws rds promote-read-replica \
  --db-instance-identifier prod-db-dr-replica \
  --region us-west-2

# Step 2: Scale up ECS services
aws ecs update-service \
  --cluster prod-cluster-dr \
  --service api-service \
  --desired-count 3 \
  --region us-west-2

# Step 3: Update Route 53 health check to point to DR endpoint
# (DNS failover handles this automatically when primary fails health check)

# Step 4: Validate application health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/prod-tg-dr/abc123 \
  --region us-west-2

Total RTO breakdown: promote RDS (5–15 min) + ECS scale-out (5–10 min) + application warmup (2–5 min) = 12–30 minutes. Infrastructure as Code is non-negotiable for Pilot Light — the entire DR environment must be deployable with a single Terraform command.

6. Warm Standby: Always-On Reduced Capacity

Warm Standby runs the DR environment continuously at reduced capacity (smaller instance types, fewer tasks). Unlike Pilot Light, there's no cold start delay — the DR environment is always "warm" and can handle traffic immediately.

Failover Timeline

Route 53 health check polls every 10 seconds with failure threshold 3 = 30 seconds to detect failure. DNS TTL: 60 seconds. ECS scale-out in DR: 2–3 minutes. Estimated total RTO: 3–5 minutes.

💡 Game Day Practice

Regularly route 5–10% of production traffic to the DR region as part of Warm Standby operation. This validates that DR actually handles production workloads correctly — not just that the infrastructure is running, but that it performs under load.

7. Multi-Site Active-Active: Zero Downtime

Multi-Site Active-Active runs identical production deployments in multiple AWS regions simultaneously, serving traffic via Route 53 Latency-Based routing. When a region fails, Route 53 health checks stop routing traffic to it — within 30–60 seconds — with no manual intervention.

The Data Consistency Challenge

Multi-region active-active is architecturally complex because writes happening in multiple regions simultaneously can conflict. Strategies:

# Terraform: DynamoDB Global Table (multi-region active-active)
resource "aws_dynamodb_table" "sessions" {
  name             = "user-sessions"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "userId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "userId"
    type = "S"
  }

  replica {
    region_name = "us-east-1"
  }

  replica {
    region_name = "eu-west-1"
  }

  replica {
    region_name = "ap-southeast-1"
  }
}
AWS RDS architecture for disaster recovery: cross-region read replica, multi-AZ standby, automated backups and snapshot replication
AWS RDS DR Architecture — Multi-AZ primary, cross-region read replica, automated backups with S3 replication for RPO and RTO targets. Source: mdsanwarhossain.me

8. Data Replication: S3, RDS, DynamoDB, EBS

S3 Cross-Region Replication (CRR)

Automatic replication of new objects to a DR region bucket. Enable versioning on both source and destination buckets. Use Replication Time Control (RTC) for SLA-backed replication: 99.99% of objects replicated within 15 minutes.

RDS Cross-Region Read Replica

Asynchronous replication to DR region. Typical replication lag <1 second, can spike under heavy write load. Promote to standalone primary on DR: aws rds promote-read-replica. After promotion, update the application's database connection string (Secrets Manager reference).

EBS Snapshot Cross-Region Copy

For EC2 instances, copy AMIs and EBS snapshots to the DR region via AWS Backup or manually. Launch EC2 instances from the copied AMI in the DR region during activation. Snapshots are incremental — only changed blocks are copied after the initial snapshot.

# Terraform: S3 Cross-Region Replication
resource "aws_s3_bucket_replication_configuration" "crr" {
  role   = aws_iam_role.s3_replication.arn
  bucket = aws_s3_bucket.primary.id

  rule {
    id     = "full-replication"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.dr.arn
      storage_class = "STANDARD_IA"

      replication_time {
        status = "Enabled"
        time   { minutes = 15 }  # RTC - SLA-backed
      }
      metrics {
        status = "Enabled"
        event_threshold { minutes = 15 }
      }
    }
  }
}

# RDS: Create cross-region read replica
resource "aws_db_instance" "dr_replica" {
  provider               = aws.us_west_2
  identifier             = "prod-db-dr"
  replicate_source_db    = "arn:aws:rds:us-east-1:123456789012:db:prod-db"
  instance_class         = "db.r7g.large"
  publicly_accessible    = false
  skip_final_snapshot    = false
  deletion_protection    = true
  tags = { Role = "dr-replica", Environment = "dr" }
}

9. DR Testing: Game Days & Automated Runbooks

DR plans fail in production for one reason: they were never tested. Stale runbooks, dependency drift, team turnover, and changed application behavior all silently invalidate a DR plan over time.

Testing Cadence

AWS Fault Injection Simulator (FIS)

FIS injects controlled failures — AZ outages, EC2 termination, latency injection, API throttling — without manual scripting. Define experiments as code, set blast radius limits, and run in production to validate DR readiness under real traffic.

# AWS Systems Manager Automation: Pilot Light DR Runbook
# Stored in SSM, versioned in Git, requires approval before execution
---
description: "Activate Pilot Light DR - Failover to us-west-2"
schemaVersion: "0.3"
assumeRole: "{{ AutomationAssumeRole }}"

parameters:
  AutomationAssumeRole:
    type: String
  NotificationArn:
    type: String

mainSteps:
  - name: VerifyPrimaryFailure
    action: aws:assertAwsResourceProperty
    inputs:
      Service: route53
      Api: GetHealthCheck
      HealthCheckId: "{{ PrimaryHealthCheckId }}"
      PropertySelector: "$.HealthCheck.HealthCheckConfig.FailureThreshold"

  - name: PromoteRDSReplica
    action: aws:executeAwsApi
    inputs:
      Service: rds
      Api: PromoteReadReplica
      DBInstanceIdentifier: "prod-db-dr-replica"

  - name: WaitForRDSPromotion
    action: aws:waitForAwsResourceProperty
    inputs:
      Service: rds
      Api: DescribeDBInstances
      DBInstanceIdentifier: "prod-db-dr-replica"
      PropertySelector: "$.DBInstances[0].DBInstanceStatus"
      DesiredValues: ["available"]
    timeoutSeconds: 900

  - name: ScaleECSServices
    action: aws:executeAwsApi
    inputs:
      Service: ecs
      Api: UpdateService
      cluster: "prod-cluster-dr"
      service: "api-service"
      desiredCount: 3

  - name: SendNotification
    action: aws:executeAwsApi
    inputs:
      Service: sns
      Api: Publish
      TopicArn: "{{ NotificationArn }}"
      Message: "DR Activation complete. Primary: us-east-1 (FAILED). Active: us-west-2."

10. DR Cost Optimization & Checklist

Cost Optimization Strategies

Planning & Architecture

  • ✅ RTO/RPO targets defined and signed off by business stakeholders
  • ✅ Application services tiered (Tier 1–4) with DR strategy per tier
  • ✅ DR region selected (ideally diagonally opposite: us-east-1 → us-west-2 or eu-west-1)
  • ✅ AWS Organizations structure: dedicated backup account for cross-account backup
  • ✅ Compliance requirements documented (PCI-DSS annual DR test, HIPAA contingency plan)

Data Backup

  • ✅ AWS Backup plan: daily, 30-day retention, cross-region copy, cross-account copy
  • ✅ Backup Vault Lock enabled for ransomware protection (WORM)
  • ✅ S3 CRR enabled with RTC for critical buckets
  • ✅ RDS read replica in DR region (for Pilot Light and Warm Standby)
  • ✅ DynamoDB Global Tables for session/profile data
  • ✅ EC2 AMIs copied to DR region for key instances

Failover Procedures

  • ✅ SSM Automation runbook: failover procedure as code, versioned in Git
  • ✅ Route 53 health checks configured; failover records in place
  • ✅ DNS TTL on failover records: 60 seconds
  • ✅ Secrets Manager secrets replicated to DR region
  • ✅ Container images in ECR replicated or accessible cross-region

Testing & Documentation

  • ✅ Restore test quarterly: actually restore RDS snapshot to DR region and validate
  • ✅ Full DR drill quarterly: execute failover in staging, measure actual RTO/RPO
  • ✅ Game day twice yearly: full team, production-like conditions
  • ✅ AWS FIS experiments: AZ failure, instance termination, latency injection
  • ✅ Runbook reviewed and updated after every DR test
  • ✅ AWS Resilience Hub: resiliency score tracked per application
⚠️ Common Anti-Patterns
  • No documented runbooks — "we'll figure it out when it happens" always fails
  • Single-region database with no read replica — a regional outage means complete data inaccessibility
  • Backups in the same AWS account as production — ransomware deletes both simultaneously
  • Untested backups — a backup that has never been restored is not a backup
  • Manual failover steps — human error under stress adds 30–60+ minutes to RTO
  • Using the same Terraform workspace for production and DR — changes can accidentally modify both environments
AWS disaster recovery RTO RPO AWS Backup pilot light warm standby multi-site active-active cross-region replication DR testing AWS reliability 2026

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · AWS · Microservices

All Posts
Last updated: April 7, 2026