Feb 14, 2017

Some interesting stuff in https://cloud.google.com/spanner/docs/whitepapers/SpannerAnd... about the social aspects of high availability.

1. Defining high availability in terms of how a system is used: "In turn, the real litmus test is whether or not users (that want their own service to be highly available) write the code to handle outage exceptions: if they haven’t written that code, then they are assuming high availability. Based on a large number of internal users of Spanner, we know that they assume Spanner is highly available."

2. Ensuring that people don't become too dependent on high availability: "Starting in 2009, due to “excess” availability, Chubby’s Site Reliability Engineers (SREs) started forcing periodic outages to ensure we continue to understand dependencies and the impact of Chubby failures."

I think 2 is really interesting. Netflix has Chaos Monkey to help address this (https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey). There's also a book called Foolproof (https://www.theguardian.com/books/2015/oct/12/foolproof-greg...) which talks about how perceived safety can lead to bigger disasters in lots of different areas: finance, driving, natural disasters, etc.

Feb 14, 2017

This is thoroughly wrong. Cloud Spanner sacrifices the "A", not the "P." The cool thing being accomplished here is that the sacrifice to the A is greatly reduced (five or more 9s). There are several documents on the subject linked right off that page and elsewhere in these same comments, like this one: https://cloud.google.com/spanner/docs/whitepapers/SpannerAnd...

Feb 14, 2017

Oh and the newer white paper from today: https://cloud.google.com/spanner/docs/whitepapers/SpannerAnd...

Feb 14, 2017

Interesting reading: Spanner Whitepaper

https://cloud.google.com/spanner/docs/whitepapers/SpannerAnd...