Skip to content
Pablo Rodriguez

Well Architected Framework Disaster Planning

Applying AWS Well-Architected Framework Principles to Disaster Planning

Section titled “Applying AWS Well-Architected Framework Principles to Disaster Planning”

The AWS Well-Architected Framework has six pillars, each including best practices and questions to consider when architecting cloud solutions. This section highlights best practices from the pillars most relevant to disaster planning:

  • Reliability
  • Operational Excellence
  • Security

Building resilient workloads helps prepare for any event that prevents a workload from fulfilling business objectives in its primary location. Failure management is essential for implementing resiliency, with disaster recovery planning as a key component.

Foundation

Having backups and redundant workload components in place is the start of your DR strategy.

Define recovery objectives for downtime and data loss:

  • Every workload has assigned RTO and RPO defined based on business impact
  • Understand impact of downtime and lost data on business
  • Impact generally grows with greater downtime or data loss, but growth shape differs by workload type

Business impact manifests in multiple forms:

  • Monetary cost: Lost revenue
  • Customer trust: Impact to reputation
  • Operational issues: Missing payroll, decreased productivity
  • Regulatory risk: Compliance violations

You might tolerate downtime for up to an hour with little impact, but after that, impact quickly rises.

RTO and RPO are primary considerations for selecting disaster recovery strategy implementation, along with:

  • Cost constraints
  • Workload dependencies
  • Operational requirements

Best Practice: Use Defined Recovery Strategies

Section titled “Best Practice: Use Defined Recovery Strategies”

Use defined recovery strategies to meet recovery objectives:

  • Define DR strategy meeting workload’s recovery objectives
  • Choose strategy such as backup and restore, pilot light, warm standby, or multi-site
  • Choosing DR strategy is trade-off between reducing downtime/data loss (RTO and RPO) and cost/complexity of implementation

Test disaster recovery implementation to validate the implementation:

  • Regularly test failover to recovery site to verify proper operation
  • Verify that RTO and RPO are met
  • The only error recovery that works is the path you test frequently

Operational Excellence Pillar: Event Management

Section titled “Operational Excellence Pillar: Event Management”

The Operational Excellence pillar includes ability to support development and run workloads effectively, gain insight into operations, and continuously improve supporting processes and procedures to deliver business value.

Best Practice: Customer Communication Plan

Section titled “Best Practice: Customer Communication Plan”

Define a customer communication plan for outages:

  • Define and test communication plan for system outages
  • Keep customers and stakeholders informed during outages
  • Communicate directly with users when services are impacted and when services return to normal

When workload is impaired, Any Company Retail:

  • Sends email notification to users describing impaired business functionality
  • Provides realistic estimate of when service will be restored
  • Maintains status page showing real-time information about workload health
  • Tests communication plan in development environment twice per year to validate effectiveness

Your organization’s BCP includes disaster recovery plan and logistics of disaster recovery. Use various AWS services and resources to design for resiliency and recovery.

Security Pillar: Identity and Access Management

Section titled “Security Pillar: Identity and Access Management”

Identity and access management considerations need to be made while preparing for disasters.

Establish emergency access process:

  • Create process providing emergency access to workloads in unlikely event of issue with centralized identity provider
  • Design processes for different failure modes that might result in emergency event

Under normal circumstances, workforce users federate to cloud using centralized identity provider to manage workloads. However, if centralized identity provider fails or federation configuration in cloud is modified, workforce users might not be able to federate into cloud.

Emergency access process gives authorized administrators access to cloud resources through alternate means to:

  • Fix issues with federation configuration
  • Fix issues with workloads
  • Reduce time taken by users to respond to and resolve emergency events
  • Result in less downtime and higher availability of services provided to customers

Well-documented and well-tested emergency access processes are essential for effective disaster recovery.

Consistently exercise disaster recovery solution to ensure it works as intended through:

  • Test scenarios when critical systems go offline
  • Test entire Region failures
  • Test response procedures for effectiveness
  • Ensure teams are familiar with implementation procedures
  • Conduct continuous testing on all components related to disaster recovery
  • Verify backups, snapshots, and AMIs can successfully restore data
  • Monitor monitoring systems
  • Test communication plans and emergency access processes

This module’s topics support Well-Architected best practices:

Section titled “Elements Related to Avoiding and Planning for Disaster”
  • Understanding failure types and scales
  • Implementing fault tolerance, backup, and disaster recovery strategies
  • Considering factors influencing disaster planning strategies
  • Defining recovery objectives based on business requirements
  • Calculating acceptable data loss and downtime
  • Aligning technical solutions with business needs
  • Backup and restore for lower-priority use cases
  • Pilot light for core services requiring moderate recovery times
  • Warm standby for business-critical services
  • Multi-site for near real-time recovery requirements
  • Storage services: S3 Cross-Region Replication, EBS snapshots
  • Compute services: AMIs, Auto Scaling
  • Database services: RDS read replicas, DynamoDB global tables
  • Networking services: Route 53, ELB
  • Automation services: CloudFormation, OpsWorks

Applying AWS Well-Architected Framework principles to disaster planning ensures that recovery strategies align with business requirements while maintaining operational excellence, security, and reliability standards. The framework provides structured approach to evaluating and improving disaster recovery capabilities.