Today, I’m going to tell you about a new site we launched at re:Invent, the Amazon Builders’ Library, a collection of living articles covering topics across architecture, software delivery, and operations. You get to peek under the hood of how Amazon architects, releases, and operates the software underpinning Amazon.com and AWS.
Want to know how Amazon.com does what it does? This is for you. In this two-part series (the next one coming December 23), I’ll highlight some of the best architecture articles written by Amazon’s senior technical leaders and engineers.
Avoiding insurmountable queue backlogs
In queueing theory, the behavior of queues when they are short is relatively uninteresting. After all, when a queue is short, everyone is happy. It’s only when the queue is backlogged, when the line to an event goes out the door and around the corner, that people start thinking about throughput and prioritization.
In this article, I discuss strategies we use at Amazon to deal with queue backlog scenarios – design approaches we take to drain queues quickly and to prioritize workloads. Most importantly, I describe how to prevent queue backlogs from building up in the first place. In the first half, I describe scenarios that lead to backlogs, and in the second half, I describe many approaches used at Amazon to avoid backlogs or deal with them gracefully.
Timeouts, retries, and backoff with jitter
Whenever one service or system calls another, failures can happen. These failures can come from a variety of factors. They include servers, networks, load balancers, software, operating systems, or even mistakes from system operators. We design our systems to reduce the probability of failure, but impossible to build systems that never fail. So in Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools: timeouts, retries, and backoff.
Challenges with distributed systems
The moment we added our second server, distributed systems became the way of life at Amazon. When I started at Amazon in 1999, we had so few servers that we could give some of them recognizable names like “fishy” or “online-01”. However, even in 1999, distributed computing was not easy. Then as now, challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems quickly grew larger and more distributed, what had been theoretical edge cases turned into regular occurrences.
Developing distributed utility computing services, such as reliable long-distance telephone networks, or Amazon Web Services (AWS) services, is hard. Distributed computing is also weirder and less intuitive than other forms of computing because of two interrelated problems. Independent failures and nondeterminism cause the most impactful issues in distributed systems. In addition to the typical computing failures most engineers are used to, failures in distributed systems can occur in many other ways. What’s worse, it’s impossible always to know whether something failed.
Static stability using Availability Zones
At Amazon, the services we build must meet extremely high availability targets. This means that we need to think carefully about the dependencies that our systems take. We design our systems to stay resilient even when those dependencies are impaired. In this article, we’ll define a pattern that we use called static stability to achieve this level of resilience. We’ll show you how we apply this concept to Availability Zones, a key infrastructure building block in AWS and therefore a bedrock dependency on which all of our services are built.
Check back in two weeks to read about some other architecture-based expert articles that let you in on how Amazon does what it does.
from AWS Architecture Blog: https://aws.amazon.com/blogs/architecture/reinvent-2019-introducing-the-amazon-builders-library-part-i/