Disaster Planning Strategies

Failures Can Occur at Any Scale

Werner Vogels, VP and CTO at Amazon, famously stated “Everything fails, all the time.” This principle influences cloud computing architectural design because failure should be assumed, not treated as an unlikely aberration.

Failures can be categorized into three types:

Small-scale events: A single server stops responding or goes offline
Large-scale events: Multiple resources are affected, perhaps across Availability Zones within an AWS Region
Global scale events: Widespread failure affecting a large number of users and systems

Avoiding and Planning for Disaster

Organizations use three primary approaches to handle disasters:

Fault Tolerance

Provides redundancy when it can withstand failure of individual or multiple components
Examples include hard disks, servers, or network connectivity failures
Production systems typically have defined uptime requirements

Backup

Critical to protecting data and ensuring business continuity
Essential challenge: data generation pace grows exponentially while local disk density and durability growth rates lag behind
Must keep critical data backed up in case of disaster

Disaster Recovery

Preparing for and recovering from any event with negative impact on business continuity or finances
Disasters include hardware/software failure, network outages, power outages, physical building damage
Can be caused by human error or other significant events
Set of policies and procedures for recovering vital technology infrastructure and systems

Factors Influencing Disaster Planning Strategies

Four key factors businesses must consider:

Time Dependency

How quickly does this need to be remedied to avoid business impact?

Data Loss

What amount and type of data is acceptable to lose?

Geographic Location

Does this impact multiple AWS Regions? Do different Regions require different recovery measurements?

Cost

Does the cost represent the correct level considering business impact and risk?

Customers work with cloud architects to define acceptable data loss in disaster recovery scenarios. Consider whether losing older customer records would be catastrophic or tolerable based on their current revenue contribution.

Recovery Point Objective (RPO)

RPO Definition

RPO is the maximum acceptable amount of data loss after an unplanned data-loss incident, expressed as an amount of time.

Example RPO Calculation

Determine acceptable loss: maximum 800 records
Calculate existing patterns: no more than 100 records created per hour
Calculate RPO: 8 hours (8 × 100 records)
Result: If disaster occurs at 10 p.m., system should recover all data from before 2 p.m.

Recovery Time Objective (RTO)

RTO Definition

RTO is the maximum acceptable amount of time after a disaster strikes that a business process can remain out of commission.

Companies decide on acceptable RPO and RTO based on financial impact when systems are unavailable, considering:

Loss of business due to downtime
Damage to reputation from lack of system availability

Example RTO Calculation

Ticketing service for local music venue
Business calculates revenue loss begins after 2 hours of outage
Calculate RTO: 2 hours acceptable
Result: If disaster occurs at 9 p.m., system should be restored by 11 p.m.

Business Continuity Plan (BCP)

A BCP is a system of prevention and recovery from potential threats to a company, consisting of:

Business impact analysis
Risk assessment
Disaster recovery plan
Evaluated and determined RPO and RTO

Your disaster recovery plan should be a subset of your organization’s BCP. Maintaining aggressive disaster recovery targets is pointless if workload objectives cannot be achieved due to disaster impact on external business elements.

Key Takeaways

Failures can occur at any time on small, large, or global scale
A disaster recovery plan helps limit business and customer impact when disasters occur
RPO is the maximum acceptable amount of data loss after an unplanned data-loss incident
RTO is the amount of time an application, system, and process can be down without causing significant business damage
A BCP is a system of prevention and recovery from potential threats that includes RPO and RTO

Disaster planning strategies provide the foundation for understanding how organizations can prepare for and respond to various types of failures. The key is balancing recovery objectives with cost considerations while ensuring business continuity.