August 4, 2023

min

Architecting Cloud Resilience: Options and Trade-offs for AWS

Tal Shladovsky

Cloud Specialist

AWS

Resilience

Disaster Recovery

TL;DR

Overview

As businesses increasingly migrate their applications and infrastructure to the cloud, ensuring resiliency becomes paramount to maintain service availability and minimize disruptions. Resilient cloud architectures are designed to withstand failures and recover quickly from disruptions, offering enhanced reliability and performance.
This article provides an overview of resiliency patterns and trade-offs that can help architects build efficient and robust cloud solutions.
In this article we’ll go over five resilience patterns and the trade-offs to consider when designing your workloads. To fulfill your business resilience requirements, you should take into account the following key factors:

Design Complexity - As system complexity rises, the resulting emergent behaviors of the system tend to increase as well. It becomes necessary to ensure resilience in each workload component and eliminate single points of failure across people, processes, and technology elements. Customers must evaluate their resilience needs and determine whether increasing system complexity is the most effective approach or if maintaining simplicity and relying on a disaster recovery (DR) plan would be more suitable.

Cost to Implement - Implementing higher resilience can lead to significant cost increases as it involves the addition of new software and infrastructure components. However, it is crucial to balance these costs against the potential expenses of future losses.

Operational Effort - The deployment and maintenance of highly resilient systems requires complex operational processes and advanced technical skills, so before you decide to implement higher resilience, you should evaluate your operational capabilities to ensure you have the necessary level of process maturity and skillsets.

Effort to Secure - While security complexity does not have a direct correlation with resilience, highly resilient systems often involve securing a greater number of components. Nevertheless, by implementing security best practices for cloud deployments, it is possible to achieve security objectives without introducing significant complexity, even with a larger deployment footprint.

Environmental Impact - Expanding the deployment footprint of resilient systems can lead to higher consumption of cloud resources. However, it is possible to manage resource consumption through trade-offs, such as utilizing approximate computing and intentionally implementing slower response times. The AWS Well-Architected Sustainability Pillar offers insights into these patterns and offers guidance on adopting sustainable best practices.

By doing so, you can achieve different levels of resiliency and effectively determine the most suitable architecture that aligns with your requirements.

Resilience patterns and trade-offs. **Source:** AWS

Note: Implementing a combination of one or more patterns is a possibility as these patterns are not mutually exclusive.

What is resiliency?

Resiliency, as defined by the AWS Well-Architected Framework, refers to the ability of a system to recover from failures and continue operating as expected.
A resilient workload has the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload's components.

Pattern 1 (P1): Multi-AZ

P1 is an architectural pattern that enhances resilience by incorporating availability zones (AZ’s). It uses multiple AZ’s within a single AWS region to ensure that your application can withstand disruptions at the AZ level.

How P1 works?

An application is running on a single EC2 instance managed by an Auto Scaling Group that uses health checks for scaling instances. In the case of an AZ is unavailable/down, the Auto Scaling detects the unhealthy instance and replaces it with a new instance in another available AZ.

Who P1 is suitable for?

Low business impact applications that can have lower requirements for resiliency, such as internal employee applications.

P1 Trade-offs

P1 has low impact in all factors. Yes, it “sweetens” an application availability disruption, but it comes with the expense of the application recovery. In the event of an AZ failure/outage, end users' access to the application will be disrupted until new resources are provisioned in a different AZ. This is known as bimodal behavior.

Pattern 2 (P2): Multi-AZ with static stability

P2 uses multiple EC2 instances across multiple availability zones (AZ’s) within a region to increase resilience, while using static stability to prevent bimodal behavior.
Static stability workloads operate in one mode regardless of changes in the operating environment.
Meaning, you should pre-provision enough instances in each availability zone to handle the workload load if one AZ were failed and then use Elastic Load Balancer or Route 53 health checks to shift load away from the unavailable instances to the available ones in the other availability zone.
One of the advantages of using a static stability approach is that it simplifies the recovery process during a disruption due to the pre-provisioned capacity of resources.

‍How P2 works?

An application is running on multiple instances managed by an Auto Scaling Group within multiple Az’s. When one AZ fails, the application will continue operating as the Elastic Load Balancer will shift end users' traffic to the working AZ.

Multi-AZ with static stability pattern (P2). **Source:** AWS

Who P2 is suitable for?

Customer-facing websites that has a lower tolerance for downtime

P2 Trade-offs

Adopting the P2 approach will result in higher reliability, as end users will not experience downtime of the application as opposed to P1, however P1 is less expensive infrastructure cost-wise, as you provision less compute capacity and rely on launching new instances in the case of a failure.
But for large-scale failures (such as an Availability Zone failure) the P1 approach is less effective because it relies on reacting to impairments as they happen, rather than being prepared for those impairments before they happen.
Therefor, to determine the most suitable solution for your workload, it is essential to balance reliability and cost requirements. If your application can support the P2 approach, then increasing the number of availability zones across a region can reduce additional compute costs, as you provision less
For example, if you use two Az’s, you should provision enough EC2 instances such that the unaffected AZ can handle 100% of the workload load.
If using three AZ’s, you should provision enough EC2 instances such that two unaffected AZ’s can handle 100% of the workload load.
This means that you only have to provision 150% of your capacity across three AZs compared with the 200% in two AZs, and by that you reduce your costs.

Static stability of EC2 instances across Availability Zones. **Source:** AWS

‍

Pattern 3 (P3): Application portfolio distribution

P3 uses a multi-region pattern to increase functional resilience by distributing different critical applications in multiple regions.
It’s true that regional service disruptions are rare, however implementing this pattern will likely ensures your end users retain access to business-critical services during a disruption.

How P3 works?

An organization that provides its services across multiple different digital channels (e.g: online website, mobile application). Each digital channel/service is deployed on a different region.
In case digital channel #1 (e.g: mobile application) is disrupted, end users can still consume the service they wished to via an alternate digital channel (channel #2 - online website).

Application portfolio distribution pattern (P3). **Source:** AWS

Who P3 is suitable for?

Business-critical services that are distributed via multiple digital channels.

P3 Trade-offs

P3 addresses the risk of a regional service disruption affecting multiple applications simultaneously.
Running an application portfolio across multiple regions requires extensive operational planning and management. While isolated functional elements might rely on shared downstream systems and data sources deployed in a single region, the introduction of region-wide events should result in reduced impact surface area, even though some disruptions may still occur.

Pattern 4 (P4): Multi-AZ deployment (multi-Region active/passive DR)

P4 uses multiple availability zones deployments in multiple regions while working with active/passive strategies that enable workload to recover from disaster events.
When selecting your disaster recovery (DR) strategy, you must weigh the benefits of lower RTO (recovery time objective) and RPO (recovery point objective) vs the costs of implementing and operating a strategy.
The Pilot Light and Warm Standby strategies offer a good balance of benefits and cost.

How P4 works?

The Pilot Light pattern is suitable for applications with recovery time objectives (RTO) and recovery point objectives (RPO) in the range of 10s of minutes. In this pattern, data is continuously replicated, and the application infrastructure is pre-provisioned in the disaster recovery (DR) Region. The main focus is cost optimization, as the application infrastructure remains switched off and is only activated during the restoration event.

The Warm Standby pattern offers a notable improvement in restore times compared to the Pilot Light approach by maintaining application availability in the disaster recovery (DR) Region, albeit at a reduced capacity. During a DR event, the application infrastructure can be automatically scaled up with minimal manual intervention. When implemented correctly, this pattern can achieve recovery time objectives (RTO) and recovery point objectives (RPO) within minutes.

Both the Pilot Light and Warm Standby strategies replicate data from the primary region to data resources in the recovery region, while these data resources are ready to serve requests. In addition to replication, both strategies require you to create a continuous backup in the recovery region, for in case of a human action type disasters, data can be deleted or corrupted, and replication will replicate the bad data. That’s why backups are necessary - to enable you to get back to the last known good state.
Resources used for the workload infrastructure are deployed in the recovery region for both strategies and will require additional actions to become production ready.
As required for all active/passive strategies, both require a means to route traffic to the primary Region, and then fail over to the recovery Region when recovering from a disaster.

Who P4 is suitable for?

Business-critical services that have a very low tolerance for disruption

‍P4 Trade-offs

P4 addresses regional service disruptions while simultaneously reducing mitigation costs. Regional disaster recovery (DR) patterns introduce increased complexity due to the synchronization of infrastructure changes across multiple Regions. Testing resilience also becomes significantly more intricate, including the simulation of regional disruptions. However, employing Infrastructure as Code for automated deployments can alleviate these challenges.

In case of disaster, both Pilot Light and Warm Standby strategies offer the capability to limit data loss (RPO). Both offer sufficient RTO performance that enables you to limit downtime. Between these two strategies, you have a choice of optimizing for RTO or for cost.

Pattern 5 (P5): Multi-Region active/active

The P5: Multi-region active/active disaster recovery involves running multiple instances of an application simultaneously across different geographically dispersed sites or regions. In this setup, all sites are actively serving live traffic and handling user requests, providing continuous availability and workload distribution. The multi-region active/active strategy will give you the lowest RTO (recovery time objective) and RPO (recovery point objective). However, this must be weighed against the potential cost and complexity of operating active stacks in multiple sites.

How P5 works?

Multi-region active/active disaster recovery involves running identical instances of an application across multiple geographically dispersed sites. Traffic is distributed among these sites through load balancing, and data synchronization ensures consistency. In case of a site failure, failover mechanisms redirect traffic to the available sites, ensuring uninterrupted service, while regular testing validates the effectiveness of the setup.

Multi-Region active/active pattern (P5). **Source:** AWS

Who P5 is suitable for?

Applications that have zero tolerance for disruption

P5 Trade-offs

P5 addresses the disruption of a regional service by investing additional costs and complexity to achieve a near-zero recovery time objective (RTO). Multi-active deployments, which involve multiple collaborating applications, are generally complex and require asynchronous data replication across regions, impacting data consistency. Operating this pattern require a high level of process maturity, and it is advisable for customers to gradually progress towards it by initially adopting the deployment patterns described earlier.

How can Stream.Security help with architecting your cloud resilience?

Stream.Security Architectural Standards can help you understand how you implemented these resilience principals.
For example, Stream.Security can alert on the following:‍

Ensure CloudTrail trails have multi-region enabled

Ensuring that Amazon CloudTrail trails are enabled for all the supported AWS cloud regions increases the visibility of the API activity in your AWS account for security and management purposes. Applying CloudTrail trails to all AWS regions has multiple advantages such as receiving log files from all regions in a single S3 bucket and a single CloudWatch Logs log group, managing trail configuration for all AWS regions from one location, and record API calls in regions that are not used often in order to detect unusual activity.

Ensure Database Migration Service (DMS) replication instances are using Multi-AZ deployment configurations

Ensuring Database Migration Service (DMS) replication instances are using Multi-AZ deployment configurations provides High Availability (HA) through automatic failover to standby replicas in the event of a failure such as an Availability Zone (AZ) outage, an internal hardware or network outage, a software failure or in case of a planned maintenance session.

Ensure Elastic Load Balancer (ELB) cross-zone load balancing feature is enabled

Enabling Cross-Zone Load Balancing simplifies the deployment and management of applications that operate across multiple subnets located in different Availability Zones (AZs), while also ensuring improved fault tolerance and consistent traffic flow. With this feature enabled, the load balancer acts as a traffic guard in the event of an AZ failure due to a network outage or power loss. It stops requests from being routed to the unhealthy zone and instead redirects them to the other available zone(s).

EC2 Instance in one Availability Zone uses a NAT Gateway in a different Availability Zone

NAT gateways in each Availability Zone are implemented with redundancy. If you have resources in multiple Availability Zones and they share one NAT gateway, in the event that the NAT gateway's Availability Zone is down, resources in the other Availability Zones lose internet access, To create an Availability Zone-independent architecture, create a NAT gateway in each Availability Zone and configure your routing to ensure that resources use the NAT gateway in the same Availability Zone. The NAT Gateway enables outgoing Internet connectivity for a private subnet. It is important to note that you need to create a NAT Gateway for every Availability Zone that you have created private subnets to achieve high availability.

Conclusion

AWS offers five resilience patterns that provide organizations with a range of options to enhance the reliability and robustness of their cloud architectures. Each pattern comes with its own trade-offs, benefits, and suitability for specific use cases. By understanding these patterns and their characteristics, businesses can make informed decisions on which pattern aligns best with their requirements.

About Stream Security

Stream Security is an AI Detection & Response (AI DR) company built for the era of AI-driven environments across cloud, on-prem, and SaaS. As AI agents operate with real permissions and attackers move at machine speed, Stream enables security teams to keep pace by continuously computing a real-time, deterministic model of their entire environment. Powered by its CloudTwin® technology, Stream instantly understands the full impact of every action across identities, permissions, networks, and resources, allowing organizations to detect, prioritize, and safely respond to threats before they propagate. This transforms security from reactive detection into a true control plane for modern infrastructure.