What do you do when you find a snake in your datacenter?


Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. You might think Facebook solved all of its fault tolerance problems long ago, but when a serpent enters the Edenic datacenter realm, even Facebook must consult the Tree of Knowledge.

In this case, it’s not good or evil we’ll learn about, but Workload Placement, which is a method of optimally placing work over a set of failure domains. 

Here’s my gloss of the Fault Tolerance through Optimal Workload Placement talk:

