Skip to content
Pablo Rodriguez

Disaster Planning Strategies

Werner Vogels, VP and CTO at Amazon, famously stated “Everything fails, all the time.” This principle influences cloud computing architectural design because failure should be assumed, not treated as an unlikely aberration.

Failures can be categorized into three types:

  • Small-scale events: A single server stops responding or goes offline
  • Large-scale events: Multiple resources are affected, perhaps across Availability Zones within an AWS Region
  • Global scale events: Widespread failure affecting a large number of users and systems

Organizations use three primary approaches to handle disasters:

  • Provides redundancy when it can withstand failure of individual or multiple components
  • Examples include hard disks, servers, or network connectivity failures
  • Production systems typically have defined uptime requirements
  • Critical to protecting data and ensuring business continuity
  • Essential challenge: data generation pace grows exponentially while local disk density and durability growth rates lag behind
  • Must keep critical data backed up in case of disaster
  • Preparing for and recovering from any event with negative impact on business continuity or finances
  • Disasters include hardware/software failure, network outages, power outages, physical building damage
  • Can be caused by human error or other significant events
  • Set of policies and procedures for recovering vital technology infrastructure and systems

Factors Influencing Disaster Planning Strategies

Section titled “Factors Influencing Disaster Planning Strategies”

Four key factors businesses must consider:

Time Dependency

How quickly does this need to be remedied to avoid business impact?

Data Loss

What amount and type of data is acceptable to lose?

Geographic Location

Does this impact multiple AWS Regions? Do different Regions require different recovery measurements?

Cost

Does the cost represent the correct level considering business impact and risk?

Customers work with cloud architects to define acceptable data loss in disaster recovery scenarios. Consider whether losing older customer records would be catastrophic or tolerable based on their current revenue contribution.

RPO Definition

RPO is the maximum acceptable amount of data loss after an unplanned data-loss incident, expressed as an amount of time.

  • Determine acceptable loss: maximum 800 records
  • Calculate existing patterns: no more than 100 records created per hour
  • Calculate RPO: 8 hours (8 × 100 records)
  • Result: If disaster occurs at 10 p.m., system should recover all data from before 2 p.m.
RTO Definition

RTO is the maximum acceptable amount of time after a disaster strikes that a business process can remain out of commission.

Companies decide on acceptable RPO and RTO based on financial impact when systems are unavailable, considering:

  • Loss of business due to downtime
  • Damage to reputation from lack of system availability
  • Ticketing service for local music venue
  • Business calculates revenue loss begins after 2 hours of outage
  • Calculate RTO: 2 hours acceptable
  • Result: If disaster occurs at 9 p.m., system should be restored by 11 p.m.

A BCP is a system of prevention and recovery from potential threats to a company, consisting of:

  • Business impact analysis
  • Risk assessment
  • Disaster recovery plan
  • Evaluated and determined RPO and RTO

Your disaster recovery plan should be a subset of your organization’s BCP. Maintaining aggressive disaster recovery targets is pointless if workload objectives cannot be achieved due to disaster impact on external business elements.

  • Failures can occur at any time on small, large, or global scale
  • A disaster recovery plan helps limit business and customer impact when disasters occur
  • RPO is the maximum acceptable amount of data loss after an unplanned data-loss incident
  • RTO is the amount of time an application, system, and process can be down without causing significant business damage
  • A BCP is a system of prevention and recovery from potential threats that includes RPO and RTO

Disaster planning strategies provide the foundation for understanding how organizations can prepare for and respond to various types of failures. The key is balancing recovery objectives with cost considerations while ensuring business continuity.