Well Architected Principles

Applying AWS Well-Architected Framework Principles to Highly Available Systems

Well-Architected Framework for High Availability

The AWS Well-Architected Framework has six pillars, and each pillar includes best practices and a set of questions that you should consider when you architect cloud solutions. This section highlights a few best practices from the reliability and performance efficiency pillars that are most relevant to highly available systems.

High Availability Focus

A system that is available is capable of delivering the designed functionality at a given point in time. Highly available systems are those that can withstand some measure of degradation while still remaining available.

Failure Management - Use Fault Isolation

Best Practice: Deploy the workload to multiple locations

To ensure that failures do not affect a whole system, use fault isolation boundaries. Fault isolation boundaries limit the effect of a failure within a workload to a limited number of components. The failure doesn’t affect components outside the boundary.

Deploy the workload to multiple locations:

Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions
These locations can be as diverse as required
One of the bedrock principles for service design in AWS is avoiding single points of failure in underlying physical infrastructure
This motivates users to build software and systems that use multiple Availability Zones and are resilient to the failure of a single zone
When building a system that relies on redundant components, it’s important to ensure that the components operate independently and, in the case of AWS Regions, autonomously

Best Practice: Automate recovery for components constrained to a single location

If the best practice to deploy the workload to multiple locations is not possible due to technological constraints, you must implement an alternate path to resiliency:

You must automate the ability to recreate necessary infrastructure, redeploy applications, and recreate necessary data for these cases
Deploy your instances or containers by using automatic scaling when possible
If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events
Use Amazon EC2 Auto Scaling groups for instances and container workloads that have no requirements for a single instance IP address, private IP address, Elastic IP address, and instance metadata

Failure Management - Design Your Workload to Withstand Component Failures

Best Practice: Fail over to healthy resources

It is a best practice to have a backup plan if resources are impaired. When designing a service, always distribute the load across resources, Availability Zones, or Regions.

Fail over to healthy resources:

If a resource failure occurs, healthy resources should continue to serve requests
For location impairments (such as Availability Zone or AWS Region), ensure that you have systems in place to fail over to healthy resources in unimpaired locations
When designing a service, always distribute the load across resources, Availability Zones, or Regions
Shifting traffic to the remaining healthy resources can mitigate the failure of an individual resource or impairment
Consider how services are discovered and routed to in the event of a failure

AWS Service Examples

The patterns and designs that allow for the failover vary for each AWS platform service:

Multi-AZ by Default
Requires Design

AWS Managed Services

Many AWS managed services are deployed in multiple Availability Zones such as Lambda, Aurora, and DynamoDB
These services automatically handle failover for you

Monitoring Requirements:

Monitoring should be set up to check that the failover resource is healthy
Track the progress of the resources failing over
Monitor business process recovery
The desired outcome is that systems are capable of automatically or manually using new resources to recover from degradation

Best Practice: Send notifications when events impact availability

Send notifications when events impact availability:

Notifications are sent upon the detection of thresholds breached even if the event causing the issue was automatically resolved
Automated healing helps make your workload reliable, but it can also obscure underlying problems that need to be addressed
Implement appropriate monitoring and events so that you can detect patterns of problems, including those addressed by auto healing, so that you can resolve the root cause of issues

Notification Requirements

Resilient systems are designed so that degradation events are immediately communicated to the appropriate teams:

These notifications should be sent through one or many communication channels
The desired outcome is to send alerts immediately to operations teams when thresholds are breached
Alert on critical key performance indicator (KPI) metrics such as error rates, latency, or other critical metrics
Resolve issues as soon as possible to avoid or minimize user impact

Compute and Hardware - Scale Your Compute Resources Dynamically

Best Practice: Scale your compute resources dynamically

Automation is key in assigning enough resources for your workload. The optimal compute choice for a particular workload can vary based on application design, usage patterns, and configuration settings. Architectures might use different compute choices for various components and allow different features to improve performance.

Scale your compute resources dynamically:

AWS provides the flexibility to scale your resources up or down dynamically through a variety of scaling mechanisms in order to meet changes in demand. Combined with compute-related metrics, dynamic scaling allows workloads to automatically respond to changes and use the optimal set of resources to achieve its goal.

Scaling Approaches

You must ensure that workload deployments can handle both scale-up and scale-down events. You can use a number of different approaches to match the supply of resources with demand:

Target-tracking Approach

Monitor your scaling metric, and automatically increase or decrease capacity as you need it.

Predictive Scaling

Scale in anticipation of daily and weekly trends.

Schedule-based Approach

Set your own scaling schedule according to predictable load changes.

Service Scaling

Choose services such as serverless services that automatically scale by design.

Key Takeaways

The AWS Well-Architected Framework principles for high availability include:

Deploy the workload to multiple locations - distribute across Availability Zones and Regions to avoid single points of failure
Automate recovery for components constrained to a single location - implement self-healing mechanisms when multi-location deployment isn’t possible
Fail over to healthy resources - ensure systems can automatically shift traffic to healthy components during failures
Send notifications when events impact availability - implement monitoring and alerting to detect and respond to issues quickly
Scale your compute resources dynamically - use automation to match resource supply with demand through various scaling approaches

These principles guide the design of resilient, highly available systems that can withstand component failures and continue to deliver functionality even under adverse conditions.