Werner Vogels, VP and CTO at Amazon, famously stated “Everything fails, all the time.” This principle influences cloud computing architectural design because failure should be assumed, not treated as an unlikely aberration.
Failures can be categorized into three types:
Small-scale events: A single server stops responding or goes offline
Large-scale events: Multiple resources are affected, perhaps across Availability Zones within an AWS Region
Global scale events: Widespread failure affecting a large number of users and systems
How quickly does this need to be remedied to avoid business impact?
Data Loss
What amount and type of data is acceptable to lose?
Geographic Location
Does this impact multiple AWS Regions? Do different Regions require different recovery measurements?
Cost
Does the cost represent the correct level considering business impact and risk?
Customers work with cloud architects to define acceptable data loss in disaster recovery scenarios. Consider whether losing older customer records would be catastrophic or tolerable based on their current revenue contribution.
A BCP is a system of prevention and recovery from potential threats to a company, consisting of:
Business impact analysis
Risk assessment
Disaster recovery plan
Evaluated and determined RPO and RTO
Your disaster recovery plan should be a subset of your organization’s BCP. Maintaining aggressive disaster recovery targets is pointless if workload objectives cannot be achieved due to disaster impact on external business elements.
Failures can occur at any time on small, large, or global scale
A disaster recovery plan helps limit business and customer impact when disasters occur
RPO is the maximum acceptable amount of data loss after an unplanned data-loss incident
RTO is the amount of time an application, system, and process can be down without causing significant business damage
A BCP is a system of prevention and recovery from potential threats that includes RPO and RTO
Disaster planning strategies provide the foundation for understanding how organizations can prepare for and respond to various types of failures. The key is balancing recovery objectives with cost considerations while ensuring business continuity.