Designing for Failure in the Cloud
In the world of cloud architecture, we are often advised to design for failure. This includes:
- Utilising multiple availability zones
- Implementing premium storage solutions with improved SLAs
- Establishing a secondary region, or even a third for redundancy
- Scaling out rather than up
While these strategies are sound in moving the dial for availability, they can lead to substantial cloud consumption costs.
The Dilemma of Availability Architecture
In many organisations, the architecture designed for availability stems from a mindset focused on worst-case scenarios. Outages can be catastrophic, and the fear of being held responsible for downtime drives teams to duplicate services, invest in excess capacity, and incorporate failover options “just in case.” Unfortunately, these decisions often go unexamined, leading to a gradual escalation of costs over time.
For instance, encountering instances where secondary regions remained fully provisioned but never utilised, zone redundancy was applied to non-critical services, and premium services ran in environments with negligible traffic, just in case.
🧭The Heart of the Matter
The core issue is not resilience itself; it is that we have stopped reviewing the cost versus benefit of our architectures as aligned with actual business requirements.
🔄Common Over-Engineering Patterns
Teams typically do not set out to overspend on availability. However, designs often emerge from good intentions aimed at protecting against failure, reducing risk, and improving uptimes. Without clear architectural guidance, these strategies can expand into unnecessary costs. Here are some examples:
- Always-On Capacity in Multiple Regions. Establishing a secondary region is a prudent high-availability strategy, but keeping both regions active incurs double costs. Often, the secondary region remains unused for prolonged periods. If failover is infrequent, consider using an active-passive setup or a cold standby model that reduces expenses.
- Zone Redundancy by Default. Services like Azure Storage and PaaS services offer zone-redundant options that can be beneficial but may not always be necessary. For instance, development and test environments might use zone redundancy without a compelling business justification, leading to unnecessary costs.
- Premium SKUs Selected Without Review. Premium tiers often come with SLAs and extra features. However, these choices rarely undergo subsequent scrutiny. In fact, the reason for the choosing the Premium tier may have become redundant through service evolution, e.g. Azure released reliable Standard SSDs providing many of the original benefits of the Premium SSDs without compromising reliability. Migrating from Premium to Standard in this scenario could provide a considerable cost saving without reducing reliability if the uptime SLA was the deciding factor.
- Overprovisioning for Hypothetical Traffic. Planning for outages often leads to provisioning backup regions or failover paths at peak usage levels. However, during an actual incident, traffic may decline significantly. Matching backup capacity to production scale should only occur if explicitly mandated by the business.
- Identical Infrastructure Across Environments. While maintaining consistency across Development, UAT, and Production environments is essential, applying the same architecture—including regional failover—can lead to excessive costs in non-critical settings.
These architectures may not always be inherently wrong; however, when not reviewed against the business requirements, they shift cloud spending away from business value creation.
💡Designing for Failure Without Excess Costs
Resilience is crucial; however, not every system necessitates maximum uptime. Many workloads can tolerate brief outages or degraded performance. Designing for failure involves being strategic about what you protect, how you recover, and what expenses you are willing to incur.
Here are practical strategies to help minimise availability costs while still guarding against failures:
- Utilise Active-Passive Failover Where Feasible. Not every workload demands full active-active deployments. For internal tools or low-traffic customer-facing applications, an active-passive model may suffice. Keep the passive region either warm or cold based on recovery time objectives and only scale when failover is activated.
- Select Lower Tiers for Standby Solutions. Your secondary region does not need to replicate production specifications. Employ lower SKUs or standard tiers for back-up paths. If failover events are rare, slight degradation in user experience is acceptable without compromising service integrity.
- Focus on Rapid Recovery Rather Than Constant Uptime. If high availability costs exceed the business value of the system, prioritise quick recovery through automation and snapshots instead of duplicating infrastructure.
- Align Strategy With Service Criticality. Critical production systems may warrant premium uptime guarantees; however, non-critical tools do not require such measures. Tailor resilience to reflect the potential impact of failure rather than applying a uniform standard across environments.
- Monitor Failover Costs Diligently. If a standby region exists, continuous monitoring is vital as these environments can inadvertently transition into production usage due to added resources or services.
⚖️Conclusion: Finding the Balance
Resilience is essential; however, so is cost management. Many teams approach availability without evaluating its true business value. This leads to infrastructure duplication, overprovisioned capacity, and unnecessary premium cost uptime guarantees, especially when the risks are low.
FinOps principles are not about eliminating the guardrails; they aim to highlight the trade-offs involved. They provide the framework necessary to view design through the cost lens. Cost conscious design discussions allows organisations to make intentional resilience choice aligned to business value.