Disaster Recovery Patterns

Common Disaster Recovery Patterns on AWS

Four common disaster recovery patterns provide different combinations of RPO, RTO, and cost-effectiveness:

Backup and Restore

Solutions requiring RTO and RPO in hours - Lower-priority use cases

Pilot Light

Solutions requiring RTO and RPO in 10s of minutes - Core services

Warm Standby

Solutions requiring RTO and RPO in minutes - Business-critical services

Multi-Site

Solutions requiring RTO and RPO in near real-time - Automatic failover

Each pattern is well-suited to different requirements. Some provide faster RPO and RTO but cost more to maintain.

Backup and Restore Pattern

Suitable approach for mitigating against data loss or corruption. Can mitigate against regional disaster by replicating data to other AWS Regions, or lack of redundancy for workloads deployed to single Availability Zone.

Key Characteristics

Can take long time to restore system when disaster occurs
Amazon S3 provides conveniently accessible destination for backup data needed quickly for restore
Data transfer to/from Amazon S3 typically done through network, accessible from any location

Implementation Example

Data copied from S3 bucket every 30 days
DataSync or Amazon S3 Transfer Acceleration can automate or increase speed of data transfer
S3 lifecycle configuration moves backup data to less-expensive storage classes every 90 days
Backup data moves to Amazon S3 Glacier Flexible Retrieval or S3 Standard-IA

Restore Process

Download backup data from Amazon S3 back to on-premises servers
If corporate data center remains offline, Amazon EC2 servers in VPC can connect to S3 bucket containing backup application data
Can temporarily host applications while working to restore data center

AWS Storage Gateway for Backup and Restore

AWS Storage Gateway is hybrid storage service enabling on-premises applications to use AWS Cloud storage for backup, archiving, disaster recovery, cloud data processing, storage tiering, and migration.

Three Gateway Interfaces

Store and retrieve objects using NFS or SMB protocol in Amazon S3. Use local cache for low-latency access to most recently used data.

Implementation Steps

Preparation Phase:

Create backups of current systems
Store backups in Amazon S3
Document procedure to restore from backups
Know which AMI to use, how to restore system, route traffic, and configure deployment

In Case of Disaster:

Retrieve backups from Amazon S3
Start required infrastructure (EC2 instances, VPC, subnets, security groups)
Restore system from backups
Route traffic to new system (adjust DNS records)

Use AWS CloudFormation stacks to restore infrastructure consistently across Regions. Use AMIs to create EC2 instances with required operating systems and packages.

Pilot Light Pattern

Describes disaster recovery pattern where minimal backup version of environment is always running. The pilot light analogy comes from gas heater: small flame always on can quickly ignite entire furnace.

Key Characteristics

Secondary database always running as “pilot light”
Main drawback: secondary cannot handle entire load of primary and needs to scale quickly
Recovery time typically faster than backup-and-restore because core pieces already running and kept up to date
Can rapidly provision full production environment around critical core
Relatively inexpensive to implement

Infrastructure Elements

Database servers as critical core (the pilot light)
All other infrastructure pieces quickly provisioned around pilot light
Preconfigured servers bundled as AMIs ready to start at moment’s notice
Instances might be in stopped state for quick startup

Failover Process

When disaster strikes and primary application goes offline:

Quickly commission compute resources to run application
Orchestrate failover to pilot light resources in AWS
New web server and app server start up and connect to secondary database
Route 53 configured to route traffic to new web server

Implementation Steps

Preparation Phase:

Configure EC2 instances to replicate servers
Create and maintain AMIs of key servers where fast recovery needed
Regularly run, test, and update servers
Consider automating provisioning of AWS resources

In Case of Disaster:

Automatically bring up resources around replicated core dataset
Scale system as needed to handle current production traffic
Switch over to new system by adjusting DNS records to point to backup deployment

Warm Standby Pattern

Like pilot light, but more resources already running. Describes disaster recovery scenario where scaled-down version of fully functional environment is always running in cloud.

Key Characteristics

Extends pilot light elements and preparation
Decreases recovery time because some services always running
Fully duplicate business-critical systems and have them always on
Servers running on minimum-sized fleet of EC2 instances with smallest possible sizes
Not scaled to take full production load, but fully functional
Can be used for non-production work like testing, quality assurance, internal use

Software Licensing Consideration

Failover Process

Route 53 distributes requests between main system and backup system
In disaster, Route 53 switches over to secondary system if primary environment unavailable
Secondary system quickly scales up to handle production load by adding more EC2 instances or resizing to larger instance types
Horizontal scaling preferred over vertical scaling

Implementation Steps

Preparation Phase:

Similar to pilot light pattern
All necessary components running 24/7, but not scaled for production traffic
Conduct continuous testing on components
Trickle statistical subset of production traffic to DR site to verify seamless function

In Case of Disaster:

Immediately fail over most critical production load
Adjust DNS records to point to AWS
Route 53 handles failover automatically with health checks
Automatically scale system further to handle all production load

Multi-Site Pattern

Fully functional system runs in second Region simultaneously with on-premises systems or systems in different Region.

Key Characteristics

Runs in active-active configuration
Data replication method determined by chosen recovery point
Both sites support full production capacity
Can use DNS service supporting weighted routing (like Route 53)
Proportion of traffic goes to AWS infrastructure, remainder to on-site infrastructure

Traffic Distribution

Adjust DNS weighting to send all traffic to cloud environment
Capacity of cloud deployment rapidly increased to handle full production load
Use Amazon EC2 Auto Scaling to automate scaling process
May need application logic to detect primary database failure and cut over to parallel running database services

Cost Considerations

Cost determined by volume of production traffic during normal operation
In recovery phase, pay only for what you use for duration DR environment needed at full scale
Can reduce cost by purchasing Amazon EC2 Reserved Instances for always-on AWS servers

Routing Policies with Route 53

Geolocation routing: Configure which Region request goes to based on origin location
Latency routing: AWS automatically sends requests to Region providing shortest round-trip time

Data governance strategy helps inform which routing policy to use. Geolocation provides deterministic distribution and can keep user data within specific Region.

Implementation Steps

Preparation Phase:

Similar to warm standby pattern
Configure backup deployment for full scaling in and out of production load
Have servers running and ready to receive traffic
Consider licensing cost of complete duplicate system

In Case of Disaster:

Immediately fail over all production load to backup site
Potentially least downtime of all patterns
Higher costs because entire duplicate system must be created at secondary site

Summary of DR Patterns

Cost vs RTO Relationship

Each pattern offers different combination of benefits:

Backup and Restore: Lowest cost, longest RTO - systems restored more slowly
Pilot Light: Moderate cost, moderate RTO - 10s of minutes recovery
Warm Standby: Higher cost, faster RTO - minutes recovery for business-critical services
Multi-Site: Highest cost, fastest RTO - near real-time recovery with automatic failover

With AWS, you can cost-effectively operate each DR strategy. These patterns are examples of possible approaches - variations and combinations are possible.

Practice Game Day Exercises

Best practice to consistently exercise DR solution to ensure it works as intended:

Key Activities

Ensure backups, snapshots, and AMIs are being created and can successfully restore data
Monitor your monitoring system
Establish RTO and RPO, work to improve them where possible
Test response procedures to ensure effectiveness
Ensure teams are familiar with implementation procedures
Set up regular Game Days to test workload and team responses to simulated events

Practice Game Day exercises test scenarios when critical systems go offline or even entire Regions fail.

Disaster recovery patterns provide structured approaches to balancing recovery objectives with cost considerations. The choice of pattern depends on business requirements, acceptable downtime, data loss tolerance, and budget constraints.