Well Architected Framework Disaster Planning

Applying AWS Well-Architected Framework Principles to Disaster Planning

Well-Architected Framework Overview

The AWS Well-Architected Framework has six pillars, each including best practices and questions to consider when architecting cloud solutions. This section highlights best practices from the pillars most relevant to disaster planning:

Reliability
Operational Excellence
Security

Reliability Pillar: Failure Management

Plan for Disaster Recovery

Building resilient workloads helps prepare for any event that prevents a workload from fulfilling business objectives in its primary location. Failure management is essential for implementing resiliency, with disaster recovery planning as a key component.

Foundation

Having backups and redundant workload components in place is the start of your DR strategy.

Best Practice: Define Recovery Objectives

Define recovery objectives for downtime and data loss:

Every workload has assigned RTO and RPO defined based on business impact
Understand impact of downtime and lost data on business
Impact generally grows with greater downtime or data loss, but growth shape differs by workload type

Business impact manifests in multiple forms:

Monetary cost: Lost revenue
Customer trust: Impact to reputation
Operational issues: Missing payroll, decreased productivity
Regulatory risk: Compliance violations

Tolerance Example

You might tolerate downtime for up to an hour with little impact, but after that, impact quickly rises.

RTO and RPO are primary considerations for selecting disaster recovery strategy implementation, along with:

Cost constraints
Workload dependencies
Operational requirements

Best Practice: Use Defined Recovery Strategies

Use defined recovery strategies to meet recovery objectives:

Define DR strategy meeting workload’s recovery objectives
Choose strategy such as backup and restore, pilot light, warm standby, or multi-site
Choosing DR strategy is trade-off between reducing downtime/data loss (RTO and RPO) and cost/complexity of implementation

Best Practice: Test Implementation

Test disaster recovery implementation to validate the implementation:

Regularly test failover to recovery site to verify proper operation
Verify that RTO and RPO are met
The only error recovery that works is the path you test frequently

Operational Excellence Pillar: Event Management

Manage Workload and Operations Events

The Operational Excellence pillar includes ability to support development and run workloads effectively, gain insight into operations, and continuously improve supporting processes and procedures to deliver business value.

Best Practice: Customer Communication Plan

Define a customer communication plan for outages:

Define and test communication plan for system outages
Keep customers and stakeholders informed during outages
Communicate directly with users when services are impacted and when services return to normal

Communication Plan Example

When workload is impaired, Any Company Retail:

Sends email notification to users describing impaired business functionality
Provides realistic estimate of when service will be restored
Maintains status page showing real-time information about workload health
Tests communication plan in development environment twice per year to validate effectiveness

Your organization’s BCP includes disaster recovery plan and logistics of disaster recovery. Use various AWS services and resources to design for resiliency and recovery.

Security Pillar: Identity and Access Management

Permissions Management

Identity and access management considerations need to be made while preparing for disasters.

Best Practice: Emergency Access Process

Establish emergency access process:

Create process providing emergency access to workloads in unlikely event of issue with centralized identity provider
Design processes for different failure modes that might result in emergency event

Emergency Access Scenario

Under normal circumstances, workforce users federate to cloud using centralized identity provider to manage workloads. However, if centralized identity provider fails or federation configuration in cloud is modified, workforce users might not be able to federate into cloud.

Emergency access process gives authorized administrators access to cloud resources through alternate means to:

Fix issues with federation configuration
Fix issues with workloads
Reduce time taken by users to respond to and resolve emergency events
Result in less downtime and higher availability of services provided to customers

Well-documented and well-tested emergency access processes are essential for effective disaster recovery.

Practice and Testing Components

Consistently exercise disaster recovery solution to ensure it works as intended through:

Practice Game Day Exercises

Test scenarios when critical systems go offline
Test entire Region failures
Test response procedures for effectiveness
Ensure teams are familiar with implementation procedures

Continuous Testing

Conduct continuous testing on all components related to disaster recovery
Verify backups, snapshots, and AMIs can successfully restore data
Monitor monitoring systems
Test communication plans and emergency access processes

Key Integration Points

This module’s topics support Well-Architected best practices:

Understanding failure types and scales
Implementing fault tolerance, backup, and disaster recovery strategies
Considering factors influencing disaster planning strategies

RTO and RPO Implementation

Defining recovery objectives based on business requirements
Calculating acceptable data loss and downtime
Aligning technical solutions with business needs

Common Disaster Recovery Patterns

Backup and restore for lower-priority use cases
Pilot light for core services requiring moderate recovery times
Warm standby for business-critical services
Multi-site for near real-time recovery requirements

AWS Services for Resiliency and Recovery

Storage services: S3 Cross-Region Replication, EBS snapshots
Compute services: AMIs, Auto Scaling
Database services: RDS read replicas, DynamoDB global tables
Networking services: Route 53, ELB
Automation services: CloudFormation, OpsWorks

Applying AWS Well-Architected Framework principles to disaster planning ensures that recovery strategies align with business requirements while maintaining operational excellence, security, and reliability standards. The framework provides structured approach to evaluating and improving disaster recovery capabilities.