What is The AWS Shared Responsibility Model for DR and how does it work?

AWS provides infrastructure availability (AZ redundancy, hardware fault tolerance, global network). You are responsible for your application's disaster recovery: data backup, cross-region replication, failover automation, and restore testing. AWS going down doesn't excuse your SLA violation. AZ failure: Rare but documented (us-east-1 2012, 2019). Multi-AZ deployment handles this — but it's not DR. Regional failure: Extremely rare but not impossible (us-east-1 cascading Kinesis failure 2021 affected multiple services simultaneously). Only cross-region DR handles this. Data corruption: Accidental DELETE, bug that overwrites data. Backups with point-in-time recovery are your only defense. Ransomware / accidental deletion: Same-account backups are vulnerable. Cross-account backup to a dedicated backup account is required for resilience.

System Design

AWS Disaster Recovery & Backup Strategy: RTO/RPO, Cross-Region & Restore Testing

Q: What is the difference between DR vs High Availability?

High Availability (HA) = redundancy within a region (Multi-AZ RDS, ALB across AZs, Auto Scaling). HA handles hardware failures and AZ outages automatically without human intervention. Disaster Recovery (DR) = the plan for when an entire region fails, data is corrupted, or a catastrophic event makes the primary environment unusable. DR requires explicit design, automation, and regular testing.

Q: What is RTO & RPO and how does it work?

RTO (Recovery Time Objective) is the maximum acceptable downtime after a disaster. RPO (Recovery Point Objective) is the maximum acceptable data loss, measured in time. Both are business decisions, not engineering defaults — get stakeholder sign-off with explicit cost implications. A useful rule of thumb: each order-of-magnitude improvement in RTO (hours → minutes → seconds) roughly 3–10× the cost of DR infrastructure. Map your services to tiers, get business sign-off on the cost implications, and don't over-engineer DR for Tier 4 systems.

Q: What is AWS Backup and how does it work?

AWS Backup is the control plane for all backup operations across AWS services — RDS, DynamoDB, EFS, EBS, S3, FSx, Aurora, DocumentDB, Neptune, EC2 AMIs, and VMware on AWS. Instead of configuring backups per service, define policies centrally.

Disaster recovery planning is the engineering work nobody wants to do — until the day the primary region goes down and the team scrambles for a runbook that nobody tested. This guide provides a complete, opinionated DR framework for AWS: from setting RTO/RPO targets and choosing the right strategy to automating backups, cross-region replication, and running game days.

Md Sanwar Hossain April 7, 2026 23 min read AWS Reliability

AWS disaster recovery backup strategy RTO RPO cross-region restore testing

TL;DR

"Match your DR strategy to your RTO/RPO budget: Backup & Restore for RPO hours/RTO days (cheapest), Pilot Light for RPO minutes/RTO hours (moderate), Warm Standby for RPO seconds/RTO minutes (higher cost), Multi-Site Active-Active for RPO~0/RTO~0 (highest cost). Automate backups with AWS Backup, test restores quarterly, and run game days twice yearly — untested DR plans fail when you need them most."

Why Disaster Recovery Planning Before Production is Non-Negotiable
RTO & RPO: Setting Realistic Targets
DR Strategy Spectrum: Backup to Multi-Site
AWS Backup: Centralized Backup Automation
Pilot Light DR Architecture
Warm Standby: Always-On Reduced Capacity
Multi-Site Active-Active: Zero Downtime
Data Replication: S3, RDS, DynamoDB, EBS
DR Testing: Game Days & Automated Runbooks
DR Cost Optimization & Checklist

1. Why Disaster Recovery Planning Before Production is Non-Negotiable

40% of businesses don't reopen after a major data loss event. For financial services, average downtime costs $9,000–$17,000 per minute. Yet DR planning is routinely deferred until "after launch" — which means it never happens before the first crisis.

The AWS Shared Responsibility Model for DR

AWS provides infrastructure availability (AZ redundancy, hardware fault tolerance, global network). You are responsible for your application's disaster recovery: data backup, cross-region replication, failover automation, and restore testing. AWS going down doesn't excuse your SLA violation.

AZ failure: Rare but documented (us-east-1 2012, 2019). Multi-AZ deployment handles this — but it's not DR.
Regional failure: Extremely rare but not impossible (us-east-1 cascading Kinesis failure 2021 affected multiple services simultaneously). Only cross-region DR handles this.
Data corruption: Accidental DELETE, bug that overwrites data. Backups with point-in-time recovery are your only defense.
Ransomware / accidental deletion: Same-account backups are vulnerable. Cross-account backup to a dedicated backup account is required for resilience.

DR vs High Availability

High Availability (HA) = redundancy within a region (Multi-AZ RDS, ALB across AZs, Auto Scaling). HA handles hardware failures and AZ outages automatically without human intervention.

Disaster Recovery (DR) = the plan for when an entire region fails, data is corrupted, or a catastrophic event makes the primary environment unusable. DR requires explicit design, automation, and regular testing.

2. RTO & RPO: Setting Realistic Targets

RTO (Recovery Time Objective) is the maximum acceptable downtime after a disaster. RPO (Recovery Point Objective) is the maximum acceptable data loss, measured in time. Both are business decisions, not engineering defaults — get stakeholder sign-off with explicit cost implications.

Business Tier	RTO	RPO	DR Strategy	Approx. Monthly Cost
Tier 1 Critical (payments, auth)	< 1 min	Near 0	Multi-Site Active-Active	$10,000+
Tier 2 Important (API, core services)	< 15 min	< 1 min	Warm Standby	$2,000–5,000
Tier 3 Standard (reporting, analytics)	< 1 hour	< 15 min	Pilot Light	$500–2,000
Tier 4 Low Priority (batch, dev tools)	< 24 hours	< 4 hours	Backup & Restore	$50–200

A useful rule of thumb: each order-of-magnitude improvement in RTO (hours → minutes → seconds) roughly 3–10× the cost of DR infrastructure. Map your services to tiers, get business sign-off on the cost implications, and don't over-engineer DR for Tier 4 systems.

3. DR Strategy Spectrum: Backup to Multi-Site

The AWS Well-Architected Framework defines four DR strategies arranged on a spectrum from cheapest/slowest to most expensive/fastest:

AWS disaster recovery strategies comparison: backup, pilot light, warm standby, multi-site — AWS DR Strategy Spectrum — Backup & Restore through Multi-Site Active-Active, trading cost for RTO/RPO. Source: mdsanwarhossain.me

Strategy	RPO	RTO	Cost Factor	Complexity
Backup & Restore	Hours	Days	1x	Low
Pilot Light	Minutes	Hours	3–5x	Medium
Warm Standby	Seconds	Minutes	7–10x	High
Multi-Site Active-Active	Near 0	Near 0	20–30x	Very High

4. AWS Backup: Centralized Backup Automation

AWS Backup is the control plane for all backup operations across AWS services — RDS, DynamoDB, EFS, EBS, S3, FSx, Aurora, DocumentDB, Neptune, EC2 AMIs, and VMware on AWS. Instead of configuring backups per service, define policies centrally.

Key AWS Backup Features

Backup plan: Define backup rules (schedule, retention period, lifecycle to cold storage, cross-region copy rules).
Backup vault: Encrypted storage backed by S3 with KMS-managed keys. Vault Lock prevents backup deletion even by root account (WORM compliance).
Cross-region copy: Automatically copy backups to a DR region (e.g., us-east-1 → us-west-2). Costs $0.02/GB for cross-region data transfer.
Cross-account backup: Copy to a dedicated backup account in your AWS Organization. This protects against ransomware or accidental deletion in the source account — the most critical protection often overlooked.
Backup Audit Manager: Compliance reporting and evidence collection for PCI-DSS, HIPAA, SOC2 auditors.
Lifecycle management: Move backups to S3 Glacier Instant Retrieval after 30 days ($0.004/GB/month), S3 Glacier Deep Archive after 90 days ($0.00099/GB/month).

# Terraform: AWS Backup plan — daily backups, 30-day retention, cross-region copy
resource "aws_backup_plan" "production" {
  name = "prod-backup-plan"

  rule {
    rule_name         = "daily-backup"
    target_vault_name = aws_backup_vault.primary.name
    schedule          = "cron(0 2 * * ? *)"  # 2 AM UTC daily
    start_window      = 60
    completion_window = 180

    lifecycle {
      cold_storage_after = 30  # days
      delete_after       = 365 # days
    }

    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn
      lifecycle {
        cold_storage_after = 30
        delete_after       = 365
      }
    }
  }
}

resource "aws_backup_selection" "production" {
  name         = "prod-backup-selection"
  iam_role_arn = aws_iam_role.backup.arn
  plan_id      = aws_backup_plan.production.id

  selection_tag {
    type  = "STRINGEQUALS"
    key   = "BackupEnabled"
    value = "true"
  }
}

AWS Backup centralized automation architecture — AWS Backup Centralized Architecture — primary account backups copied cross-region and cross-account. Source: mdsanwarhossain.me

5. Pilot Light DR Architecture

Pilot Light keeps the minimum viable DR infrastructure always running in the secondary region — just enough to bootstrap full capacity when disaster strikes. Think of it as a pilot flame: always burning at low cost, ready to ignite the full furnace.

Always Running in DR Region

RDS read replica of production database (promotes to primary in 5–15 minutes)
Route 53 failover records pointing to DR ALB (dormant, health check failing)
VPC, subnets, security groups, IAM roles — network foundation in place
ECR container images replicated or accessible from DR region

Stopped/Minimal in DR Region

EC2 Auto Scaling groups: desired capacity = 0
ECS services: desired tasks = 0
ElastiCache: not running (restore from backup on DR activation)

Failover Procedure (SSM Automation)

# AWS CLI: Pilot Light DR Failover Steps

# Step 1: Promote RDS read replica in DR region (~5-15 min)
aws rds promote-read-replica \
  --db-instance-identifier prod-db-dr-replica \
  --region us-west-2

# Step 2: Scale up ECS services
aws ecs update-service \
  --cluster prod-cluster-dr \
  --service api-service \
  --desired-count 3 \
  --region us-west-2

# Step 3: Update Route 53 health check to point to DR endpoint
# (DNS failover handles this automatically when primary fails health check)

# Step 4: Validate application health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/prod-tg-dr/abc123 \
  --region us-west-2

Total RTO breakdown: promote RDS (5–15 min) + ECS scale-out (5–10 min) + application warmup (2–5 min) = 12–30 minutes. Infrastructure as Code is non-negotiable for Pilot Light — the entire DR environment must be deployable with a single Terraform command.

6. Warm Standby: Always-On Reduced Capacity

Warm Standby runs the DR environment continuously at reduced capacity (smaller instance types, fewer tasks). Unlike Pilot Light, there's no cold start delay — the DR environment is always "warm" and can handle traffic immediately.

Production: 10 × r6g.xlarge ECS tasks in us-east-1
DR (normal): 2 × r6g.large ECS tasks in us-west-2 (scaled down, validating config works)
DR (activated): Scale up to 10 × r6g.xlarge ECS tasks in us-west-2

Failover Timeline

Route 53 health check polls every 10 seconds with failure threshold 3 = 30 seconds to detect failure. DNS TTL: 60 seconds. ECS scale-out in DR: 2–3 minutes. Estimated total RTO: 3–5 minutes.

💡 Game Day Practice

Regularly route 5–10% of production traffic to the DR region as part of Warm Standby operation. This validates that DR actually handles production workloads correctly — not just that the infrastructure is running, but that it performs under load.

Database: Aurora Global Database provides sub-second replication to DR region. On failover, promote the secondary to writer in <1 minute.
Sessions: Stateless application + ElastiCache Global Datastore = no session loss on failover. Users stay logged in.
Cost: DR region typically 20–30% of production cost (running at reduced capacity continuously).

7. Multi-Site Active-Active: Zero Downtime

Multi-Site Active-Active runs identical production deployments in multiple AWS regions simultaneously, serving traffic via Route 53 Latency-Based routing. When a region fails, Route 53 health checks stop routing traffic to it — within 30–60 seconds — with no manual intervention.

The Data Consistency Challenge

Multi-region active-active is architecturally complex because writes happening in multiple regions simultaneously can conflict. Strategies:

DynamoDB Global Tables: Multi-region active-active with last-writer-wins conflict resolution. Sub-second replication. RPO ~0. Best choice for session data, user profiles, catalog.
Aurora Global Database: One write region + up to 5 read regions. RPO ~1 second. Failover promotes a reader to writer in <1 minute. Best for relational data needing ACID guarantees.
Region affinity: Route users to their "home" region via geolocation/cookie. All writes for a user stay in one region. Simplest consistency model, but doesn't eliminate cross-region reads.

# Terraform: DynamoDB Global Table (multi-region active-active)
resource "aws_dynamodb_table" "sessions" {
  name             = "user-sessions"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "userId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "userId"
    type = "S"
  }

  replica {
    region_name = "us-east-1"
  }

  replica {
    region_name = "eu-west-1"
  }

  replica {
    region_name = "ap-southeast-1"
  }
}

AWS RDS architecture for disaster recovery: cross-region read replica, multi-AZ standby, automated backups and snapshot replication — AWS RDS DR Architecture — Multi-AZ primary, cross-region read replica, automated backups with S3 replication for RPO and RTO targets. Source: mdsanwarhossain.me

8. Data Replication: S3, RDS, DynamoDB, EBS

S3 Cross-Region Replication (CRR)

Automatic replication of new objects to a DR region bucket. Enable versioning on both source and destination buckets. Use Replication Time Control (RTC) for SLA-backed replication: 99.99% of objects replicated within 15 minutes.

RDS Cross-Region Read Replica

Asynchronous replication to DR region. Typical replication lag <1 second, can spike under heavy write load. Promote to standalone primary on DR: aws rds promote-read-replica. After promotion, update the application's database connection string (Secrets Manager reference).

EBS Snapshot Cross-Region Copy

For EC2 instances, copy AMIs and EBS snapshots to the DR region via AWS Backup or manually. Launch EC2 instances from the copied AMI in the DR region during activation. Snapshots are incremental — only changed blocks are copied after the initial snapshot.

# Terraform: S3 Cross-Region Replication
resource "aws_s3_bucket_replication_configuration" "crr" {
  role   = aws_iam_role.s3_replication.arn
  bucket = aws_s3_bucket.primary.id

  rule {
    id     = "full-replication"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.dr.arn
      storage_class = "STANDARD_IA"

      replication_time {
        status = "Enabled"
        time   { minutes = 15 }  # RTC - SLA-backed
      }
      metrics {
        status = "Enabled"
        event_threshold { minutes = 15 }
      }
    }
  }
}

# RDS: Create cross-region read replica
resource "aws_db_instance" "dr_replica" {
  provider               = aws.us_west_2
  identifier             = "prod-db-dr"
  replicate_source_db    = "arn:aws:rds:us-east-1:123456789012:db:prod-db"
  instance_class         = "db.r7g.large"
  publicly_accessible    = false
  skip_final_snapshot    = false
  deletion_protection    = true
  tags = { Role = "dr-replica", Environment = "dr" }
}

9. DR Testing: Game Days & Automated Runbooks

DR plans fail in production for one reason: they were never tested. Stale runbooks, dependency drift, team turnover, and changed application behavior all silently invalidate a DR plan over time.

Testing Cadence

Monthly: Component-level tests — restore a single RDS snapshot, validate ElastiCache failover, test Lambda@Edge edge cases.
Quarterly: Full DR drill — execute the complete failover procedure in a staging environment. Measure actual RTO vs target. Document gaps.
Twice yearly: Game Day — full team exercise with production-like conditions. Rotate on-call, announce scope, execute failover, debrief.

AWS Fault Injection Simulator (FIS)

FIS injects controlled failures — AZ outages, EC2 termination, latency injection, API throttling — without manual scripting. Define experiments as code, set blast radius limits, and run in production to validate DR readiness under real traffic.

# AWS Systems Manager Automation: Pilot Light DR Runbook
# Stored in SSM, versioned in Git, requires approval before execution
---
description: "Activate Pilot Light DR - Failover to us-west-2"
schemaVersion: "0.3"
assumeRole: "{{ AutomationAssumeRole }}"

parameters:
  AutomationAssumeRole:
    type: String
  NotificationArn:
    type: String

mainSteps:
  - name: VerifyPrimaryFailure
    action: aws:assertAwsResourceProperty
    inputs:
      Service: route53
      Api: GetHealthCheck
      HealthCheckId: "{{ PrimaryHealthCheckId }}"
      PropertySelector: "$.HealthCheck.HealthCheckConfig.FailureThreshold"

  - name: PromoteRDSReplica
    action: aws:executeAwsApi
    inputs:
      Service: rds
      Api: PromoteReadReplica
      DBInstanceIdentifier: "prod-db-dr-replica"

  - name: WaitForRDSPromotion
    action: aws:waitForAwsResourceProperty
    inputs:
      Service: rds
      Api: DescribeDBInstances
      DBInstanceIdentifier: "prod-db-dr-replica"
      PropertySelector: "$.DBInstances[0].DBInstanceStatus"
      DesiredValues: ["available"]
    timeoutSeconds: 900

  - name: ScaleECSServices
    action: aws:executeAwsApi
    inputs:
      Service: ecs
      Api: UpdateService
      cluster: "prod-cluster-dr"
      service: "api-service"
      desiredCount: 3

  - name: SendNotification
    action: aws:executeAwsApi
    inputs:
      Service: sns
      Api: Publish
      TopicArn: "{{ NotificationArn }}"
      Message: "DR Activation complete. Primary: us-east-1 (FAILED). Active: us-west-2."

10. DR Cost Optimization & Checklist

Cost Optimization Strategies

Move backups to S3 Glacier Instant Retrieval after 30 days (90% cost reduction vs S3 Standard)
Auto Scaling desired capacity = 0 for compute in Pilot Light DR region
Reserved Instances (1-year) for Warm Standby instances (saves 30–40% vs On-Demand)
S3 Intelligent-Tiering for objects with unpredictable access patterns during DR periods
Stop non-critical DR resources (ElastiCache, OpenSearch) until DR activation

Planning & Architecture

✅ RTO/RPO targets defined and signed off by business stakeholders
✅ Application services tiered (Tier 1–4) with DR strategy per tier
✅ DR region selected (ideally diagonally opposite: us-east-1 → us-west-2 or eu-west-1)
✅ AWS Organizations structure: dedicated backup account for cross-account backup
✅ Compliance requirements documented (PCI-DSS annual DR test, HIPAA contingency plan)

Data Backup

✅ AWS Backup plan: daily, 30-day retention, cross-region copy, cross-account copy
✅ Backup Vault Lock enabled for ransomware protection (WORM)
✅ S3 CRR enabled with RTC for critical buckets
✅ RDS read replica in DR region (for Pilot Light and Warm Standby)
✅ DynamoDB Global Tables for session/profile data
✅ EC2 AMIs copied to DR region for key instances

Failover Procedures

✅ SSM Automation runbook: failover procedure as code, versioned in Git
✅ Route 53 health checks configured; failover records in place
✅ DNS TTL on failover records: 60 seconds
✅ Secrets Manager secrets replicated to DR region
✅ Container images in ECR replicated or accessible cross-region

Testing & Documentation

✅ Restore test quarterly: actually restore RDS snapshot to DR region and validate
✅ Full DR drill quarterly: execute failover in staging, measure actual RTO/RPO
✅ Game day twice yearly: full team, production-like conditions
✅ AWS FIS experiments: AZ failure, instance termination, latency injection
✅ Runbook reviewed and updated after every DR test
✅ AWS Resilience Hub: resiliency score tracked per application

⚠️ Common Anti-Patterns

No documented runbooks — "we'll figure it out when it happens" always fails
Single-region database with no read replica — a regional outage means complete data inaccessibility
Backups in the same AWS account as production — ransomware deletes both simultaneously
Untested backups — a backup that has never been restored is not a backup
Manual failover steps — human error under stress adds 30–60+ minutes to RTO
Using the same Terraform workspace for production and DR — changes can accidentally modify both environments

AWS disaster recovery RTO RPO AWS Backup pilot light warm standby multi-site active-active cross-region replication DR testing AWS reliability 2026

Md Sanwar Hossain

Software Engineer · Java · Spring Boot · AWS · Microservices

All Posts

Back to Blog

Last updated: April 7, 2026

AWS Disaster Recovery & Backup Strategy: RTO/RPO, Cross-Region & Restore Testing

TL;DR

Table of Contents

1. Why Disaster Recovery Planning Before Production is Non-Negotiable

The AWS Shared Responsibility Model for DR

DR vs High Availability

2. RTO & RPO: Setting Realistic Targets

3. DR Strategy Spectrum: Backup to Multi-Site

4. AWS Backup: Centralized Backup Automation

Key AWS Backup Features

5. Pilot Light DR Architecture

Always Running in DR Region

Stopped/Minimal in DR Region

Failover Procedure (SSM Automation)

6. Warm Standby: Always-On Reduced Capacity

Failover Timeline

7. Multi-Site Active-Active: Zero Downtime

The Data Consistency Challenge

8. Data Replication: S3, RDS, DynamoDB, EBS

S3 Cross-Region Replication (CRR)

RDS Cross-Region Read Replica

EBS Snapshot Cross-Region Copy

9. DR Testing: Game Days & Automated Runbooks

Testing Cadence

AWS Fault Injection Simulator (FIS)

10. DR Cost Optimization & Checklist

Cost Optimization Strategies

Planning & Architecture

Data Backup

Failover Procedures

Testing & Documentation

Leave a Comment

Related Posts

AWS Disaster Recovery & Backup Strategy: RTO/RPO, Cross-Region & Restore Testing

TL;DR

Table of Contents

1. Why Disaster Recovery Planning Before Production is Non-Negotiable

The AWS Shared Responsibility Model for DR

DR vs High Availability

2. RTO & RPO: Setting Realistic Targets

3. DR Strategy Spectrum: Backup to Multi-Site

4. AWS Backup: Centralized Backup Automation

Key AWS Backup Features

5. Pilot Light DR Architecture

Always Running in DR Region

Stopped/Minimal in DR Region

Failover Procedure (SSM Automation)

6. Warm Standby: Always-On Reduced Capacity

Failover Timeline

7. Multi-Site Active-Active: Zero Downtime

The Data Consistency Challenge

8. Data Replication: S3, RDS, DynamoDB, EBS

S3 Cross-Region Replication (CRR)

RDS Cross-Region Read Replica

EBS Snapshot Cross-Region Copy

9. DR Testing: Game Days & Automated Runbooks

Testing Cadence

AWS Fault Injection Simulator (FIS)

10. DR Cost Optimization & Checklist

Cost Optimization Strategies

Planning & Architecture

Data Backup

Failover Procedures

Testing & Documentation

Leave a Comment

Related Posts

Multi-Region Architecture on AWS: Patterns & Trade-offs

AWS RDS PostgreSQL Performance Tuning with Spring Boot

AWS S3 Advanced Patterns with Spring Boot

Chaos Engineering: Building Resilient Systems

Cookie Notice