The AWS Well-Architected Framework has six pillars, and each pillar includes best practices and a set of questions that you should consider when you architect cloud solutions. This section highlights a few best practices from the reliability and performance efficiency pillars that are most relevant to highly available systems.
High Availability Focus
A system that is available is capable of delivering the designed functionality at a given point in time. Highly available systems are those that can withstand some measure of degradation while still remaining available.
To ensure that failures do not affect a whole system, use fault isolation boundaries. Fault isolation boundaries limit the effect of a failure within a workload to a limited number of components. The failure doesn’t affect components outside the boundary.
Deploy the workload to multiple locations:
Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions
These locations can be as diverse as required
One of the bedrock principles for service design in AWS is avoiding single points of failure in underlying physical infrastructure
This motivates users to build software and systems that use multiple Availability Zones and are resilient to the failure of a single zone
When building a system that relies on redundant components, it’s important to ensure that the components operate independently and, in the case of AWS Regions, autonomously
Best Practice: Automate recovery for components constrained to a single location
If the best practice to deploy the workload to multiple locations is not possible due to technological constraints, you must implement an alternate path to resiliency:
You must automate the ability to recreate necessary infrastructure, redeploy applications, and recreate necessary data for these cases
Deploy your instances or containers by using automatic scaling when possible
If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events
Use Amazon EC2 Auto Scaling groups for instances and container workloads that have no requirements for a single instance IP address, private IP address, Elastic IP address, and instance metadata
Failure Management - Design Your Workload to Withstand Component Failures
It is a best practice to have a backup plan if resources are impaired. When designing a service, always distribute the load across resources, Availability Zones, or Regions.
Fail over to healthy resources:
If a resource failure occurs, healthy resources should continue to serve requests
For location impairments (such as Availability Zone or AWS Region), ensure that you have systems in place to fail over to healthy resources in unimpaired locations
When designing a service, always distribute the load across resources, Availability Zones, or Regions
Shifting traffic to the remaining healthy resources can mitigate the failure of an individual resource or impairment
Consider how services are discovered and routed to in the event of a failure
Many AWS managed services are deployed in multiple Availability Zones such as Lambda, Aurora, and DynamoDB
These services automatically handle failover for you
Custom Architectures
Other AWS services, such as Amazon EC2 and Amazon EKS, require specific best practice designs to support the failover of resources or data storage across Availability Zones
You must implement specific patterns for high availability
Monitoring Requirements:
Monitoring should be set up to check that the failover resource is healthy
Track the progress of the resources failing over
Monitor business process recovery
The desired outcome is that systems are capable of automatically or manually using new resources to recover from degradation
Best Practice: Send notifications when events impact availability
Send notifications when events impact availability:
Notifications are sent upon the detection of thresholds breached even if the event causing the issue was automatically resolved
Automated healing helps make your workload reliable, but it can also obscure underlying problems that need to be addressed
Implement appropriate monitoring and events so that you can detect patterns of problems, including those addressed by auto healing, so that you can resolve the root cause of issues
Automation is key in assigning enough resources for your workload. The optimal compute choice for a particular workload can vary based on application design, usage patterns, and configuration settings. Architectures might use different compute choices for various components and allow different features to improve performance.
Scale your compute resources dynamically:
AWS provides the flexibility to scale your resources up or down dynamically through a variety of scaling mechanisms in order to meet changes in demand. Combined with compute-related metrics, dynamic scaling allows workloads to automatically respond to changes and use the optimal set of resources to achieve its goal.
You must ensure that workload deployments can handle both scale-up and scale-down events. You can use a number of different approaches to match the supply of resources with demand:
Target-tracking Approach
Monitor your scaling metric, and automatically increase or decrease capacity as you need it.
Predictive Scaling
Scale in anticipation of daily and weekly trends.
Schedule-based Approach
Set your own scaling schedule according to predictable load changes.
Service Scaling
Choose services such as serverless services that automatically scale by design.
The AWS Well-Architected Framework principles for high availability include:
Deploy the workload to multiple locations - distribute across Availability Zones and Regions to avoid single points of failure
Automate recovery for components constrained to a single location - implement self-healing mechanisms when multi-location deployment isn’t possible
Fail over to healthy resources - ensure systems can automatically shift traffic to healthy components during failures
Send notifications when events impact availability - implement monitoring and alerting to detect and respond to issues quickly
Scale your compute resources dynamically - use automation to match resource supply with demand through various scaling approaches
These principles guide the design of resilient, highly available systems that can withstand component failures and continue to deliver functionality even under adverse conditions.