Cloud Resiliency: Staying Off Downdetector

Sep 25

September 25, 2024

Written by KJ Stillabower, Executive Advisor

As enterprises embrace "Everything as a Service" (XaaS) models, many value the ability to offload inherently difficult aspects of resiliency to cloud providers. While cloud services can alleviate some of these responsibilities, they also introduce new complexities and dependencies that organizations must consider as part of their resiliency strategies. Over the past decade, many organizations have learned the hard way that the shared responsibility model includes new, often non-obvious requirements for design, planning, configuration, and maintenance to address these new risks. Faced with high-profile evidence to the contrary, regulators are increasingly unwilling to accept the notion that the cloud is infallible and can be engineered to be perfectly reliable. Understanding how the cloud reshapes resiliency planning is crucial for maintaining business continuity, ensuring customer satisfaction, and meeting regulatory and compliance requirements.

At A Glance:

Cloud services do not provide automated resiliency-as-a-service. Customers must understand new, often obscured complexities and manage them accordingly.
Unique risks in cloud environments make robust data protection mechanisms – including independent and immutable backups – critical for preventing data loss.
Detecting and managing service dependencies is crucial to prevent cascading failures in abstracted cloud services.
A seemingly diverse set of providers may rely on the same underlying infrastructure, posing hidden single points of failure or counterfeit diversity.
Public cloud providers may enhance efficiency for execution, but detailed engineering and planning should be a priority to execute safely in the cloud.

When cloud computing gained traction, one of its most enticing promises was the ability to delegate infrastructure management complexities to cloud providers. Enterprise IT teams expected that by shifting to cloud services, they could rely on utility-scale providers to handle intricate aspects of managing core infrastructure more efficiently, simplifying their operations. It was often assumed that public cloud services would be as reliable as domestic power grids. When there was a failure, it was assumed to be exceptionally rare and widespread. Most assumed individual organizations would be excused by their customers or partners for operational cloud failures.

However, reality has proven to be more nuanced. While cloud providers do offer robust infrastructure, they are not immune to failures. Over the past five years, multiple major cloud outages have impacted significant brands across various industries. More frequently, numerous smaller failures with limited impacts have disrupted operations for diverse organizations and even large enterprises. Ultimately, the shared responsibility model of cloud computing means that while providers handle certain aspects of resiliency, customers have an increased need to ensure their business remains operational in an increasingly complex and interdependent environment.

Beyond Architectural Best Practices

It is safe to assume that readers who have come this far know how to read documentation and can apply classic principles of redundancy to service architecture. Every commercial cloud service today has excellent documentation resources. These are highly detailed, high-quality documents providing clear examples for implementing fundamental strategies like redundancy, failover mechanisms, and disruption containment. These best practices, provided by every IaaS and PaaS provider, should be understood and generally followed.

This perspective focuses on critical aspects that are not typically highlighted in standard cloud vendor documentation. These challenges and pitfalls are not as commercially attractive to discuss because they do not directly promote additional services or products offered by solution providers. By bringing these considerations to the forefront, this perspective’s objective is to provide a more comprehensive understanding of cloud resiliency that goes beyond the basics and addresses the nuanced realities of operating in complex cloud environments.

Understanding Resiliency Across Service Models

Responsibilities and tools to mitigate threats vary greatly across delivery models and specific platforms. Each model requires organizations to consider potential threats and failures to their service delivery and determine how best to manage them. Classic risk management functions – identifying and managing risk through reduction, transfer, avoidance, or acceptance – help mitigate risks to availability and resiliency. As part of an effective risk management program, each service should consider threats to resiliency and formally document strategies to mitigate these risks.

As one moves down the service stack from SaaS to IaaS, resiliency planning becomes more critical. Modern IT services have complicated the traditional primary-standby configuration, sprawling into complex scenarios that are difficult to enumerate. Where organizations previously had physically distinct hardware, systems, and service providers, today it can be challenging to determine where single points of failure exist.

While most enterprise service offerings provide general architecture and resilience planning documentation – which is an excellent starting point – undetected pitfalls lurking beneath clusters and availability zones need to be anticipated and planned for accordingly.

Data Protection

Data protection in the cloud shares many similarities with traditional IT backups. Excellent services exist to archive vast amounts of data in a cost effective manner, ensuring it is accessible – provided the associated costs remain manageable. However, the cloud introduces new risks for data protection that are unique to this environment.

Platform tools for moving data may not be fully developed in all instances. Company data within the platform may not be easily exportable, replicated, or backed up. Additionally, there's increasing emphasis on independent and immutable backups, which establish a software lock to prevent modification as a defense against ransomware. Beyond cybersecurity concerns, storing backups within the same infrastructure tenant or tenant family exposes them to configuration issues and human factors that can result in data loss.

Organizations must consider appropriate ways to protect both data and the configurations of the services they are deploying. While providers may be ready to deploy new servers, without the data and software, there may be nothing to run.

Service Dependency Ordering

In cloud environments, understanding and managing service dependencies is crucial for ensuring resiliency. A service may appear stable and reliable under normal operations, but unexpected failures in underlying dependencies can cause significant disruptions. IT professionals often share stories where overlooked dependencies led to cascading failures, particularly during system restarts or disaster recovery scenarios.

A common issue is the formation of circular dependencies or "logical loops" in service bootstrapping. This occurs when a service depends on another to start, but that secondary service, in turn, depends on the first. For example, consider a virtualization cluster hosting all compute services, where the shared storage system requires authentication from a Kerberos server. In a power outage, the Kerberos server cannot start because it relies on the storage system, which itself cannot become available without the authentication service – creating a deadlock that requires manual intervention.

In the cloud, these dependency chains can be longer and more obscure due to abstraction layers and the distributed nature of cloud services. Dependencies might include identity management systems, networking configurations, storage solutions, and third-party APIs. Failure in any of these can propagate upward, affecting the availability and performance of dependent services.

Counterfeit Diversity

The benefits of hyperscale cloud providers and the diversity of services they offer can obscure where your services are provisioned and the potential impact of a particular service failure. Modern service delivery often involves partnerships, white-labeling, and service masking. As a result, a seemingly diverse portfolio of services may actually rely on the same underlying infrastructure – essentially presenting the same service under different branding. We like to refer to this as counterfeit diversity.

Counterfeit diversity poses a significant threat to cloud resiliency by masking true dependencies behind a façade of varied services. By recognizing and addressing this issue, organizations can avoid hidden single points of failure by moving beyond surface-level diversification and investing in strategies that ensure true independence where necessary. This involves active supplier management, transparent communication with vendors, and strategic planning to build a resilient cloud architecture. Unchecked, this hidden concentration risk can lead to unexpected outages and business disruptions when an underlying provider experiences issues.

Purposeful Relationships

In a cloud strategy, organizations often grapple with choosing between cloud interoperability and specialization. While making workloads portable across multiple cloud providers may seem advantageous, the costs and complexities often outweigh the benefits, especially given how rarely services are migrated between providers. Therefore, efforts should focus at both the enterprise and service levels to optimize for an optimal provider.

Stretching services across multiple platforms can be counterproductive to resiliency goals and introduce significant expenses and operational challenges. By specializing with a limited number of providers, organizations can leverage unique services more effectively and streamline operations, enhancing overall efficiency and reliability.

However, executing a resiliency strategy within a provider also necessitates a well-thought-out exit plan. Remember, a service provider provides a business relationship, not a lifelong commitment. Many providers offer proprietary services lacking direct equivalents elsewhere, making transitions complex and costly. To maintain effective resiliency, it is crucial to understand the options and prepare for potential changes. Developing a cloud exit strategy is a complex topic deserving further discussion in future perspectives………stay tuned!

My Cloud Isn’t Resilient Enough

Achieving true enterprise resilience requires more than simply offloading responsibilities to providers or following standard best practices. Organizations want to avoid being the next story on the news due to an IT incident. Accordingly, IT teams must delve deeper into the nuanced differences presented by cloud environments, proactively addressing unique and critical areas such as immutable data protection, service dependencies, counterfeit diversity, and provider optimization. By doing so, businesses can strengthen their resilience against disruptions ensuring continuity, customer satisfaction, and compliance in an ever-evolving digital landscape.

Key Takeaways

Leverage Documentation: Utilize vendor documentation to better understand and employ platforms as they are intended.

Data Protection: Implementing robust data protection mechanisms should be the most important resiliency strategy for any service. Data without backups can be permanently lost.

Service Level Agreements (SLAs): Clearly defined SLAs are essential for risk management in SaaS, obligating providers to actionable responses during failures.

Infrastructure-as-Code (IaC): Manage deployments effectively using IaC to ensure consistent and scalable recovery strategies in complex environments.

Understand Dependencies: Be aware of dependencies to prevent cascading failures in the abstracted layers of cloud services.

Avoid False Diversity: Recognize that different service providers might rely on the same underlying infrastructure, posing shared points of failure.

Plan an Exit: Have a clear exit plan to prepare for potential changes, maintaining resiliency by understanding migration options and steps.

At Windval, we understand the complexities and challenges of navigating cloud resiliency, especially within large enterprise IT environments. Our experts are dedicated to helping you develop robust strategies tailored to your unique needs. We are here to partner with you, providing the guidance and solutions necessary to enhance your cyber resilience and confidently embrace the future of cloud computing. If you have concerns about the resiliency of your cloud environment, contact us to start a conversation about how we can help.

Connect with Windval

Matthew Camden

Cloud Resiliency: Staying Off Downdetector

Beginning with the End in Mind: Thinking About a “Cloud Exit” Strategy

Practical Guidance for Adopting Zero Trust