As customers migrate to the cloud, many struggle to adapt business continuity and operational plans from their on-premises environments. This affects the resilience of critical business applications and can stall cloud adoption. This two-part blog series will provide guidance on implementing IT resilience strategies. In Part I, we’ll review challenges commonly experienced by executive builders. We’ll also explore the definition of resilience in the cloud and key considerations for adapting mindsets and organizational culture. In Part II, we’ll explore the technical considerations related to architecture and patterns.

Customer challenges

In a discussion about a data center exit strategy, a senior IT executive at a global financial services organization said, “I feel we’re more mature today to consider a business case for a data center exit […] to AWS. But my main concern is how I can ensure resilience in this hybrid environment while still meeting our regulatory compliance requirements.” After asking follow up questions, it became clear their concern was about how to implement a two-site data center disaster recovery (DR) in the cloud.

In the past 12 months, we’ve increasingly observed that business leaders identify, directly or indirectly, resilience as a primary area of concern. Often, this concern comes up in discussions about business continuity due to the COVID-19 pandemic. It’s also sometimes mixed with concerns about highly publicized security or outage events. Other times, it’s expressed during discussions around cloud technology due diligence. For the executive builders in these organizations, it has taken years to get to an IT “compliance equilibrium” with regulations. Transforming digital transformation efforts while simultaneously migrating legacy workloads to the cloud is operationally challenging. This challenge can disrupt the compliance equilibrium and stall cloud adoption.

Considerations to get started

Cloud adoption is often central to any large-scale digital transformation in a customer’s business. But transformations are disruptive, and disruption is unnerving. Building a resilience strategy and resilient infrastructure can give your organization peace of mind. But resilient systems need resilient organizations; the two go hand-in-hand. The following considerations will help executive builders get started.

1. Understanding resilience in the cloud

How should the executive builder think about resilience in the cloud? Resilience is a measure of how an infrastructure, workload, or platform can protect itself against disruption caused by adverse events and conditions. Like other architecture attributes, resilience is measured on a scale (that is, a degree to which a system is resilient). It is not measured as a binary feature (in other words, resilient vs. non-resilient).

Furthermore, resilience is an overarching attribute that is tied to other architecture attributes such as availability, security, and performance. Due to its business affinity (that is, business continuity), non-technology leaders often use the term “resilience” more broadly to mean any number of related architecture attributes. But for executive builders, we recommend centering your resilience strategies around availability, performance, and disaster recovery.

2. Practice and automate resilience strategies

How do executive builders ensure investment in resilience will pay off? Our answer: continuously subject your systems to conditions that build an organizational “muscle” that will support these systems during normal and abnormal times.

In the following subsections, we share effective practices we have observed from long-time cloud adopters to help you go beyond designing for resilience and employ an increasing degree of complexity and automation.

Architecture reviews 

The AWS Well-Architected Framework guides leaders through building and maintaining resilient infrastructures, applications, and data. At a minimum, we suggest incorporating AWS Well-Architected Reviews frequently in your lifecycle management and using the AWS Well-Architected Tool to sustain and improve resilience over time. We also suggest using the various AWS Well-Architected Lenses to consider critical workloads and technology domains such as analytics stacks or high performance computing (HPC) clusters. This practice will push resilience questions to the top of each discussion, such as “what happens when this fails?” where ”this” is any critical component of your environment.

Table-top incident simulations 

Just like routine fire drills, executive builders must periodically test their operational plan to respond to an incident.

Your operational recovery scenarios related to workloads, infrastructure, or data should be tested relative to the organization development lifecycle pace. We suggest starting with quarterly tests and working towards only testing during major lifecycle milestones.

For full disaster recovery scenarios, we suggest starting with an annual review because it may be required by compliance regulations. From there, we suggest performing a quarterly review to identify ways to strengthen your resilience.

Chaos engineering

Over time, people who have adopted cloud architecture can invest in automating many of the anticipated events and incidents that would challenge their system’s resilience. Principles of chaos engineering can be adopted to build these capabilities within your environment. For example, AWS Fault Injection Simulator can be deployed to make it easier for teams to discover weaknesses in their environments at scale. This practice will help your team adopt an “everything fails eventually” mindset, which will help them prioritize resilient design patterns.

3. Think big. Start small

Transitioning traditional IT infrastructure models to the cloud and then building in resilient processes is difficult, especially if you aim to do it all at once. However, it can be done. We have seen the most success when executive builders start with a manageable scope, iterate, then scale, as follows:

  • First, classify your technology assets according to business criticality. A technology asset may be a single application or something vital system like a customer relationship management solution that applications depend on. We see many customers use terms like “Tier 0,” “Red,” or “Mission Critical” to describe their critical assets.
  • Next, implement a resilience plan for a single critical asset or a small set of related assets.
    • You’ll need a cross-functional team to agree on the availability and performance requirements and help translate these requirements into a work backlog.
    • The team should analyze the technology asset for weaknesses using the principles of chaos engineering.
    • Business stakeholders should help capture business metrics like a reduction in unprocessed orders or an improvement in customer satisfaction.

The team that implements resilience in your first asset (or set of related assets) will form the core of a new resilience center. This team will likely be eager to share their knowledge and best practices across the organization. We suggest giving them a platform, such as a quarterly resilience review, to celebrate their success and encourage other teams to follow their example.

Conclusion

Executive builders are responsible for assuring business leaders that their IT assets are resilient and also leading their teams to achieve resilience. In this blog, we provided guidance to help these leaders align with how business stakeholders express resilience concerns, and to lead builders to approach resilience design differently in the cloud. In a follow-up article, we’ll dive into more technical resilience considerations.

Three considerations for building resilience strategy and practices

Figure 1. Three considerations for building resilience strategy and practices