Nov 28, 2020

root-cause tldr:

...[adding] new capacity [to the front-end fleet] had caused all of the servers in the [front-end] fleet to exceed the maximum number of threads allowed by an operating system configuration [number of threads spawned is directly proportional to number of servers in the fleet]. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.


...moving to larger CPU and memory servers [and thus fewer front-end servers]. Having fewer servers means that each server maintains fewer threads.

...making a number of changes to radically improve the cold-start time for the front-end fleet.

...moving the front-end server [shard-map] cache [that takes a long time to build, up to an hour sometimes?] to a dedicated fleet.

...move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet.

...accelerate the cellularization [0] of the front-end fleet to match what we’ve done with the back-end.

[0] and

Jul 20, 2020

This is why AWS is puts effort into blast radius reduction.

The Physalia paper is a wonderful read that explains both the justification and high-level approach to implementing said isolation [1].

Another great example is shuffle sharding [2], notably used in Route53.

The end goal of this is that aws outages would be isolated to subsets of aws customers where possible. So instead of half the internet failing it's only 10% (obviously the exact numbers depend heavily on implementation).

[1] [2]

Mar 05, 2020

OT: I don't know if this level of rigor is employed by any major libraries. I know AWS has used it a great deal for their systems though notably recently[1]