Skip to content
Pablo Rodriguez

Aws Disaster Recovery Planning

To properly scope disaster recovery planning, think about AWS usage holistically. Most organizations use a combination of services across five service categories:

  • Storage: Amazon S3
  • Compute: Amazon EC2
  • Database: Amazon RDS
  • Networking & Content Delivery: Amazon VPC
  • Deployment Orchestration: AWS CloudFormation (Management & Governance)

If disaster occurs, RPO and RTO guide backup and restore plans across each service area and affect production deployment architecture.

AWS Cloud storage consists of combinations of:

  • Block storage: Amazon EBS
  • File system storage: Amazon EFS
  • Object storage: Amazon S3, Amazon S3 Glacier
  • Provides movement of large amounts of data online between on-premises storage and Amazon S3, Amazon EFS, or Amazon FSx
  • Supports scripted copy jobs and scheduled data transfers from on-premises NFS and SMB storage
  • Can optionally use AWS Direct Connect links

For many organizations, bulk data stored on AWS is in Amazon S3. S3 buckets exist in specific Regions chosen during creation.

Amazon S3 provides 11 nines (99.999999999%) durability for:

  • S3 Standard
  • S3 Standard-IA
  • S3 One Zone-IA
  • Amazon S3 Glacier

S3 Standard, S3 Standard-IA, and Amazon S3 Glacier automatically store objects across minimum of three Availability Zones, each separated by miles.

For critical applications requiring higher data security:

  • Add replication configuration to source bucket
  • Minimum configuration must indicate destination bucket for object replication
  • Include IAM role granting Amazon S3 permissions to copy objects to destination bucket
  • Copied objects retain metadata
  • Destination bucket can belong to different storage class
  • Can assign different ownership to destination bucket objects

S3 Replication Time Control (S3 RTC): Replicates data across Regions in predictable timeframe, replicating 99.99% of new objects within 15 minutes, backed by SLA.

Back up EBS volume data to Amazon S3 by taking point-in-time snapshots:

  • Incremental backups: Save only blocks that changed since most recent snapshot
  • Minimizes snapshot creation time and saves storage costs by not duplicating data
  • Each snapshot contains all information needed to restore data to new EBS volume
  • New volume based on snapshot begins as exact replica of original volume
  • Volume loads data in background for immediate use
Important

After snapshot creation completes copying to Amazon S3, you can copy it between Regions or within the same Region.

Automates creation, retention, and deletion of EBS volume snapshots:

  • Protect valuable data by enforcing regular backup schedule
  • Retain backups as required by auditors or internal compliance
  • Reduce storage costs by deleting outdated backups

Replicating file storage ensures continued access to files. DataSync accelerates data movement between:

  • Two Amazon EFS or Amazon FSx for Windows File Server file systems
  • On-premises storage and AWS file storage
  • Transfer datasets over AWS Direct Connect or internet
  • Takes daily automatic backups stored in Amazon S3
  • Default 30-minute backup window during daily backup period
  • Default 7-day retention period for daily automatic backups
  • Can take additional backups at any point

Like Amazon S3 storage classes, Amazon EFS and FSx for Windows File Server replicate data across Availability Zones. For multi-Region recovery requirements, use DataSync to replicate to second Region.

Obtain and boot new server instances or containers in minutes. Can arrange automatic recovery of EC2 instance when system status check of underlying hardware fails:

  • Instance rebooted on new hardware if necessary
  • Retains instance ID, IP addresses, EBS volume attachments, and configuration details
  • For complete recovery, configure instance to automatically start services/applications during initialization
  • Preconfigured with operating systems
  • Some include application stacks
  • Can configure custom AMIs

Golden AMI: Preconfigured with all necessary applications and services to perform designated function.

Global endpoints solve multi-Region architecture resiliency issues through:

  • Global endpoint: Managed Route 53 DNS endpoint routing events to event buses in either Region, depending on primary Region service health
  • IngestionToInvocationStartLatency metric: Measures time to invoke first target after event ingestion, indicating EventBridge service health

Extended periods of high latency over 30 seconds might indicate service disruption.

Key AWS networking services for disaster recovery:

Route 53

Provides DNS-based load balancing and basic failover between endpoints or S3 websites

Elastic Load Balancing

Provides traffic distribution and makes disaster recovery implementation straightforward

Amazon VPN

Provides secure access to on-premises network resources from Amazon VPC through VPN connection

AWS Direct Connect

Provides dedicated network connection for fast, consistent data transfer between on-premises and AWS

When recovering from disaster, you’ll likely need to modify network settings to fail system over to another site.

  • Save snapshots in separate Region
  • Use read replicas and Multi-AZ deployments
  • Retain automated backups
  • Share manual snapshots with up to 20 other AWS accounts
  • Combining read replicas with Multi-AZ deployments builds resilient DR strategy

Read replicas: Create one or more read-only copies of database instance in same Region or different Region. Updates asynchronously copied to read replicas. Can be promoted to standalone database instance when needed.

  • Back up entire tables
  • Use point-in-time recovery to restore tables
  • Create backups
  • Use global tables to build multi-Region, multi-active database

DynamoDB global tables: Automatically replicate DynamoDB tables across choice of Regions, keeping applications highly available even during Region-level disasters.

  • Use templates to quickly deploy collections of resources as needed
  • Duplicate production environments in new Region or VPC in minutes
  • Model and deploy entire infrastructure in text file as single source of truth
  • Provisions resources in repeatable manner for building and rebuilding infrastructure and applications
Important

CloudFormation addresses resource configuration, not data associated with resources.

  • Application management service providing configuration management and automation
  • Deploy and operate applications of all types and sizes
  • Define environment as series of layers, configure each layer as application tier
  • Automatic host replacement for instance failures
  • Use in DR preparation phase to template environment, combine with CloudFormation in DR recovery phase

AWS disaster recovery planning involves leveraging multiple AWS services across storage, compute, database, networking, and automation categories to create comprehensive, multi-Region disaster recovery strategies that meet specific RPO and RTO requirements.