Well Architected Framework Disaster Planning
Applying AWS Well-Architected Framework Principles to Disaster Planning
Section titled “Applying AWS Well-Architected Framework Principles to Disaster Planning”Well-Architected Framework Overview
Section titled “Well-Architected Framework Overview”The AWS Well-Architected Framework has six pillars, each including best practices and questions to consider when architecting cloud solutions. This section highlights best practices from the pillars most relevant to disaster planning:
- Reliability
- Operational Excellence
- Security
Reliability Pillar: Failure Management
Section titled “Reliability Pillar: Failure Management”Plan for Disaster Recovery
Section titled “Plan for Disaster Recovery”Building resilient workloads helps prepare for any event that prevents a workload from fulfilling business objectives in its primary location. Failure management is essential for implementing resiliency, with disaster recovery planning as a key component.
FoundationHaving backups and redundant workload components in place is the start of your DR strategy.
Best Practice: Define Recovery Objectives
Section titled “Best Practice: Define Recovery Objectives”Define recovery objectives for downtime and data loss:
- Every workload has assigned RTO and RPO defined based on business impact
- Understand impact of downtime and lost data on business
- Impact generally grows with greater downtime or data loss, but growth shape differs by workload type
Business impact manifests in multiple forms:
- Monetary cost: Lost revenue
- Customer trust: Impact to reputation
- Operational issues: Missing payroll, decreased productivity
- Regulatory risk: Compliance violations
Tolerance Example
Section titled “Tolerance Example”You might tolerate downtime for up to an hour with little impact, but after that, impact quickly rises.
RTO and RPO are primary considerations for selecting disaster recovery strategy implementation, along with:
- Cost constraints
- Workload dependencies
- Operational requirements
Best Practice: Use Defined Recovery Strategies
Section titled “Best Practice: Use Defined Recovery Strategies”Use defined recovery strategies to meet recovery objectives:
- Define DR strategy meeting workload’s recovery objectives
- Choose strategy such as backup and restore, pilot light, warm standby, or multi-site
- Choosing DR strategy is trade-off between reducing downtime/data loss (RTO and RPO) and cost/complexity of implementation
Best Practice: Test Implementation
Section titled “Best Practice: Test Implementation”Test disaster recovery implementation to validate the implementation:
- Regularly test failover to recovery site to verify proper operation
- Verify that RTO and RPO are met
- The only error recovery that works is the path you test frequently
Operational Excellence Pillar: Event Management
Section titled “Operational Excellence Pillar: Event Management”Manage Workload and Operations Events
Section titled “Manage Workload and Operations Events”The Operational Excellence pillar includes ability to support development and run workloads effectively, gain insight into operations, and continuously improve supporting processes and procedures to deliver business value.
Best Practice: Customer Communication Plan
Section titled “Best Practice: Customer Communication Plan”Define a customer communication plan for outages:
- Define and test communication plan for system outages
- Keep customers and stakeholders informed during outages
- Communicate directly with users when services are impacted and when services return to normal
Communication Plan Example
Section titled “Communication Plan Example”When workload is impaired, Any Company Retail:
- Sends email notification to users describing impaired business functionality
- Provides realistic estimate of when service will be restored
- Maintains status page showing real-time information about workload health
- Tests communication plan in development environment twice per year to validate effectiveness
Your organization’s BCP includes disaster recovery plan and logistics of disaster recovery. Use various AWS services and resources to design for resiliency and recovery.
Security Pillar: Identity and Access Management
Section titled “Security Pillar: Identity and Access Management”Permissions Management
Section titled “Permissions Management”Identity and access management considerations need to be made while preparing for disasters.
Best Practice: Emergency Access Process
Section titled “Best Practice: Emergency Access Process”Establish emergency access process:
- Create process providing emergency access to workloads in unlikely event of issue with centralized identity provider
- Design processes for different failure modes that might result in emergency event
Emergency Access Scenario
Section titled “Emergency Access Scenario”Under normal circumstances, workforce users federate to cloud using centralized identity provider to manage workloads. However, if centralized identity provider fails or federation configuration in cloud is modified, workforce users might not be able to federate into cloud.
Emergency access process gives authorized administrators access to cloud resources through alternate means to:
- Fix issues with federation configuration
- Fix issues with workloads
- Reduce time taken by users to respond to and resolve emergency events
- Result in less downtime and higher availability of services provided to customers
Well-documented and well-tested emergency access processes are essential for effective disaster recovery.
Practice and Testing Components
Section titled “Practice and Testing Components”Consistently exercise disaster recovery solution to ensure it works as intended through:
Practice Game Day Exercises
Section titled “Practice Game Day Exercises”- Test scenarios when critical systems go offline
- Test entire Region failures
- Test response procedures for effectiveness
- Ensure teams are familiar with implementation procedures
Continuous Testing
Section titled “Continuous Testing”- Conduct continuous testing on all components related to disaster recovery
- Verify backups, snapshots, and AMIs can successfully restore data
- Monitor monitoring systems
- Test communication plans and emergency access processes
Key Integration Points
Section titled “Key Integration Points”This module’s topics support Well-Architected best practices:
Elements Related to Avoiding and Planning for Disaster
Section titled “Elements Related to Avoiding and Planning for Disaster”- Understanding failure types and scales
- Implementing fault tolerance, backup, and disaster recovery strategies
- Considering factors influencing disaster planning strategies
RTO and RPO Implementation
Section titled “RTO and RPO Implementation”- Defining recovery objectives based on business requirements
- Calculating acceptable data loss and downtime
- Aligning technical solutions with business needs
Common Disaster Recovery Patterns
Section titled “Common Disaster Recovery Patterns”- Backup and restore for lower-priority use cases
- Pilot light for core services requiring moderate recovery times
- Warm standby for business-critical services
- Multi-site for near real-time recovery requirements
AWS Services for Resiliency and Recovery
Section titled “AWS Services for Resiliency and Recovery”- Storage services: S3 Cross-Region Replication, EBS snapshots
- Compute services: AMIs, Auto Scaling
- Database services: RDS read replicas, DynamoDB global tables
- Networking services: Route 53, ELB
- Automation services: CloudFormation, OpsWorks
Applying AWS Well-Architected Framework principles to disaster planning ensures that recovery strategies align with business requirements while maintaining operational excellence, security, and reliability standards. The framework provides structured approach to evaluating and improving disaster recovery capabilities.