Aws Disaster Recovery Planning

AWS Disaster Recovery Planning

Multi-Region Approach

To properly scope disaster recovery planning, think about AWS usage holistically. Most organizations use a combination of services across five service categories:

Storage: Amazon S3
Compute: Amazon EC2
Database: Amazon RDS
Networking & Content Delivery: Amazon VPC
Deployment Orchestration: AWS CloudFormation (Management & Governance)

If disaster occurs, RPO and RTO guide backup and restore plans across each service area and affect production deployment architecture.

Storage and Backup Building Blocks

AWS Cloud storage consists of combinations of:

Block storage: Amazon EBS
File system storage: Amazon EFS
Object storage: Amazon S3, Amazon S3 Glacier

AWS DataSync

Provides movement of large amounts of data online between on-premises storage and Amazon S3, Amazon EFS, or Amazon FSx
Supports scripted copy jobs and scheduled data transfers from on-premises NFS and SMB storage
Can optionally use AWS Direct Connect links

Amazon S3 Cross-Region Replication

For many organizations, bulk data stored on AWS is in Amazon S3. S3 buckets exist in specific Regions chosen during creation.

S3 Durability

Amazon S3 provides 11 nines (99.999999999%) durability for:

S3 Standard
S3 Standard-IA
S3 One Zone-IA
Amazon S3 Glacier

S3 Standard, S3 Standard-IA, and Amazon S3 Glacier automatically store objects across minimum of three Availability Zones, each separated by miles.

Cross-Region Replication Configuration

For critical applications requiring higher data security:

Add replication configuration to source bucket
Minimum configuration must indicate destination bucket for object replication
Include IAM role granting Amazon S3 permissions to copy objects to destination bucket
Copied objects retain metadata
Destination bucket can belong to different storage class
Can assign different ownership to destination bucket objects

S3 Replication Time Control (S3 RTC): Replicates data across Regions in predictable timeframe, replicating 99.99% of new objects within 15 minutes, backed by SLA.

EBS Volume Snapshots

Back up EBS volume data to Amazon S3 by taking point-in-time snapshots:

Incremental backups: Save only blocks that changed since most recent snapshot
Minimizes snapshot creation time and saves storage costs by not duplicating data
Each snapshot contains all information needed to restore data to new EBS volume
New volume based on snapshot begins as exact replica of original volume
Volume loads data in background for immediate use

Important

After snapshot creation completes copying to Amazon S3, you can copy it between Regions or within the same Region.

Amazon Data Lifecycle Manager

Automates creation, retention, and deletion of EBS volume snapshots:

Protect valuable data by enforcing regular backup schedule
Retain backups as required by auditors or internal compliance
Reduce storage costs by deleting outdated backups

File System Replication

Replicating file storage ensures continued access to files. DataSync accelerates data movement between:

Two Amazon EFS or Amazon FSx for Windows File Server file systems
On-premises storage and AWS file storage
Transfer datasets over AWS Direct Connect or internet

FSx for Windows File Server Backups

Takes daily automatic backups stored in Amazon S3
Default 30-minute backup window during daily backup period
Default 7-day retention period for daily automatic backups
Can take additional backups at any point

Like Amazon S3 storage classes, Amazon EFS and FSx for Windows File Server replicate data across Availability Zones. For multi-Region recovery requirements, use DataSync to replicate to second Region.

Recovering Compute Infrastructure

Obtain and boot new server instances or containers in minutes. Can arrange automatic recovery of EC2 instance when system status check of underlying hardware fails:

Instance rebooted on new hardware if necessary
Retains instance ID, IP addresses, EBS volume attachments, and configuration details
For complete recovery, configure instance to automatically start services/applications during initialization

Amazon Machine Images (AMIs)

Preconfigured with operating systems
Some include application stacks
Can configure custom AMIs

Golden AMI: Preconfigured with all necessary applications and services to perform designated function.

EventBridge for Regional Failover

Global endpoints solve multi-Region architecture resiliency issues through:

Core Service Capabilities

Global endpoint: Managed Route 53 DNS endpoint routing events to event buses in either Region, depending on primary Region service health
IngestionToInvocationStartLatency metric: Measures time to invoke first target after event ingestion, indicating EventBridge service health

Extended periods of high latency over 30 seconds might indicate service disruption.

Designing for Resiliency and Recovery

Key AWS networking services for disaster recovery:

Route 53

Provides DNS-based load balancing and basic failover between endpoints or S3 websites

Elastic Load Balancing

Provides traffic distribution and makes disaster recovery implementation straightforward

Amazon VPN

Provides secure access to on-premises network resources from Amazon VPC through VPN connection

AWS Direct Connect

Provides dedicated network connection for fast, consistent data transfer between on-premises and AWS

When recovering from disaster, you’ll likely need to modify network settings to fail system over to another site.

Supporting Database Recovery

Amazon RDS Features

Save snapshots in separate Region
Use read replicas and Multi-AZ deployments
Retain automated backups
Share manual snapshots with up to 20 other AWS accounts
Combining read replicas with Multi-AZ deployments builds resilient DR strategy

Read replicas: Create one or more read-only copies of database instance in same Region or different Region. Updates asynchronously copied to read replicas. Can be promoted to standalone database instance when needed.

DynamoDB Features

Back up entire tables
Use point-in-time recovery to restore tables
Create backups
Use global tables to build multi-Region, multi-active database

DynamoDB global tables: Automatically replicate DynamoDB tables across choice of Regions, keeping applications highly available even during Region-level disasters.

Replicating and Redeploying Environments

AWS CloudFormation

Use templates to quickly deploy collections of resources as needed
Duplicate production environments in new Region or VPC in minutes
Model and deploy entire infrastructure in text file as single source of truth
Provisions resources in repeatable manner for building and rebuilding infrastructure and applications

Important

CloudFormation addresses resource configuration, not data associated with resources.

AWS OpsWorks

Application management service providing configuration management and automation
Deploy and operate applications of all types and sizes
Define environment as series of layers, configure each layer as application tier
Automatic host replacement for instance failures
Use in DR preparation phase to template environment, combine with CloudFormation in DR recovery phase

AWS disaster recovery planning involves leveraging multiple AWS services across storage, compute, database, networking, and automation categories to create comprehensive, multi-Region disaster recovery strategies that meet specific RPO and RTO requirements.