Skip to content
Pablo Rodriguez

Disaster Recovery Patterns

Four common disaster recovery patterns provide different combinations of RPO, RTO, and cost-effectiveness:

Backup and Restore

Solutions requiring RTO and RPO in hours - Lower-priority use cases

Pilot Light

Solutions requiring RTO and RPO in 10s of minutes - Core services

Warm Standby

Solutions requiring RTO and RPO in minutes - Business-critical services

Multi-Site

Solutions requiring RTO and RPO in near real-time - Automatic failover

Each pattern is well-suited to different requirements. Some provide faster RPO and RTO but cost more to maintain.

Suitable approach for mitigating against data loss or corruption. Can mitigate against regional disaster by replicating data to other AWS Regions, or lack of redundancy for workloads deployed to single Availability Zone.

  • Can take long time to restore system when disaster occurs
  • Amazon S3 provides conveniently accessible destination for backup data needed quickly for restore
  • Data transfer to/from Amazon S3 typically done through network, accessible from any location
  • Data copied from S3 bucket every 30 days
  • DataSync or Amazon S3 Transfer Acceleration can automate or increase speed of data transfer
  • S3 lifecycle configuration moves backup data to less-expensive storage classes every 90 days
  • Backup data moves to Amazon S3 Glacier Flexible Retrieval or S3 Standard-IA
  • Download backup data from Amazon S3 back to on-premises servers
  • If corporate data center remains offline, Amazon EC2 servers in VPC can connect to S3 bucket containing backup application data
  • Can temporarily host applications while working to restore data center

AWS Storage Gateway for Backup and Restore

Section titled “AWS Storage Gateway for Backup and Restore”

AWS Storage Gateway is hybrid storage service enabling on-premises applications to use AWS Cloud storage for backup, archiving, disaster recovery, cloud data processing, storage tiering, and migration.

Store and retrieve objects using NFS or SMB protocol in Amazon S3. Use local cache for low-latency access to most recently used data.

Preparation Phase:

  • Create backups of current systems
  • Store backups in Amazon S3
  • Document procedure to restore from backups
  • Know which AMI to use, how to restore system, route traffic, and configure deployment

In Case of Disaster:

  • Retrieve backups from Amazon S3
  • Start required infrastructure (EC2 instances, VPC, subnets, security groups)
  • Restore system from backups
  • Route traffic to new system (adjust DNS records)

Use AWS CloudFormation stacks to restore infrastructure consistently across Regions. Use AMIs to create EC2 instances with required operating systems and packages.

Describes disaster recovery pattern where minimal backup version of environment is always running. The pilot light analogy comes from gas heater: small flame always on can quickly ignite entire furnace.

  • Secondary database always running as “pilot light”
  • Main drawback: secondary cannot handle entire load of primary and needs to scale quickly
  • Recovery time typically faster than backup-and-restore because core pieces already running and kept up to date
  • Can rapidly provision full production environment around critical core
  • Relatively inexpensive to implement
  • Database servers as critical core (the pilot light)
  • All other infrastructure pieces quickly provisioned around pilot light
  • Preconfigured servers bundled as AMIs ready to start at moment’s notice
  • Instances might be in stopped state for quick startup

When disaster strikes and primary application goes offline:

  • Quickly commission compute resources to run application
  • Orchestrate failover to pilot light resources in AWS
  • New web server and app server start up and connect to secondary database
  • Route 53 configured to route traffic to new web server

Preparation Phase:

  • Configure EC2 instances to replicate servers
  • Create and maintain AMIs of key servers where fast recovery needed
  • Regularly run, test, and update servers
  • Consider automating provisioning of AWS resources

In Case of Disaster:

  • Automatically bring up resources around replicated core dataset
  • Scale system as needed to handle current production traffic
  • Switch over to new system by adjusting DNS records to point to backup deployment

Like pilot light, but more resources already running. Describes disaster recovery scenario where scaled-down version of fully functional environment is always running in cloud.

  • Extends pilot light elements and preparation
  • Decreases recovery time because some services always running
  • Fully duplicate business-critical systems and have them always on
  • Servers running on minimum-sized fleet of EC2 instances with smallest possible sizes
  • Not scaled to take full production load, but fully functional
  • Can be used for non-production work like testing, quality assurance, internal use
  • Route 53 distributes requests between main system and backup system
  • In disaster, Route 53 switches over to secondary system if primary environment unavailable
  • Secondary system quickly scales up to handle production load by adding more EC2 instances or resizing to larger instance types
  • Horizontal scaling preferred over vertical scaling

Preparation Phase:

  • Similar to pilot light pattern
  • All necessary components running 24/7, but not scaled for production traffic
  • Conduct continuous testing on components
  • Trickle statistical subset of production traffic to DR site to verify seamless function

In Case of Disaster:

  • Immediately fail over most critical production load
  • Adjust DNS records to point to AWS
  • Route 53 handles failover automatically with health checks
  • Automatically scale system further to handle all production load

Fully functional system runs in second Region simultaneously with on-premises systems or systems in different Region.

  • Runs in active-active configuration
  • Data replication method determined by chosen recovery point
  • Both sites support full production capacity
  • Can use DNS service supporting weighted routing (like Route 53)
  • Proportion of traffic goes to AWS infrastructure, remainder to on-site infrastructure
  • Adjust DNS weighting to send all traffic to cloud environment
  • Capacity of cloud deployment rapidly increased to handle full production load
  • Use Amazon EC2 Auto Scaling to automate scaling process
  • May need application logic to detect primary database failure and cut over to parallel running database services
  • Cost determined by volume of production traffic during normal operation
  • In recovery phase, pay only for what you use for duration DR environment needed at full scale
  • Can reduce cost by purchasing Amazon EC2 Reserved Instances for always-on AWS servers
  • Geolocation routing: Configure which Region request goes to based on origin location
  • Latency routing: AWS automatically sends requests to Region providing shortest round-trip time

Data governance strategy helps inform which routing policy to use. Geolocation provides deterministic distribution and can keep user data within specific Region.

Preparation Phase:

  • Similar to warm standby pattern
  • Configure backup deployment for full scaling in and out of production load
  • Have servers running and ready to receive traffic
  • Consider licensing cost of complete duplicate system

In Case of Disaster:

  • Immediately fail over all production load to backup site
  • Potentially least downtime of all patterns
  • Higher costs because entire duplicate system must be created at secondary site
Cost vs RTO Relationship

Each pattern offers different combination of benefits:

  • Backup and Restore: Lowest cost, longest RTO - systems restored more slowly
  • Pilot Light: Moderate cost, moderate RTO - 10s of minutes recovery
  • Warm Standby: Higher cost, faster RTO - minutes recovery for business-critical services
  • Multi-Site: Highest cost, fastest RTO - near real-time recovery with automatic failover

With AWS, you can cost-effectively operate each DR strategy. These patterns are examples of possible approaches - variations and combinations are possible.

Best practice to consistently exercise DR solution to ensure it works as intended:

  • Ensure backups, snapshots, and AMIs are being created and can successfully restore data
  • Monitor your monitoring system
  • Establish RTO and RPO, work to improve them where possible
  • Test response procedures to ensure effectiveness
  • Ensure teams are familiar with implementation procedures
  • Set up regular Game Days to test workload and team responses to simulated events

Practice Game Day exercises test scenarios when critical systems go offline or even entire Regions fail.

Disaster recovery patterns provide structured approaches to balancing recovery objectives with cost considerations. The choice of pattern depends on business requirements, acceptable downtime, data loss tolerance, and budget constraints.