Nov 14, 2017

What they're trying to explain is that so-called AP systems do not have 100% availability either. The CAP theorem applies to very specific scenarios that are not necessarily common - it is a much more narrow theorem then "CP vs AP" implies. For example, if your network partitions are usually small (i.e. there exists a majority outside the partition) and you can avoid routing requests to the minority partition (e.g. because you peer with ISPs and can route somewhere else, or internally avoid routing to a rack that is don etc.) then you won't observe a loss of A.

For more information on this perspective you can read _Spanner, TrueTime and the CAP theorem_ by the fella that coined the term CAP: https://static.googleusercontent.com/media/research.google.c...

Yammer concurred with this perspective, stating:

"At Yammer we have experience with AP systems, and we’ve seen loss of availability for both Cassandra and Riak for various reasons. Our AP systems have not been more reliable than our CP systems, yet they have been more difficult to work with and reason about in the presence of inconsistencies. Other companies have also seen outages with AP systems in production. So in practice, AP systems are just as susceptible as CP systems to outages due to issues such as human error and buggy code, both on the client side and the server side."

https://yokota.blog/2017/02/17/dont-settle-for-eventual-cons...

Oct 10, 2017

Leveraging accurate clocks doesn't let Google ignore partitions. "TrueTime itself could be hindered by a partition"[0]. Spanner also uses two-phase commits and locking, which are unavailable under certain kinds of network partitions.

From their 2017 paper on Spanner and CAP:

> To the extent there is anything special, it is really Google’s wide-area network, plus many years of operational improvements, that greatly limit partitions in practice, and thus enable high availability.

[0] https://static.googleusercontent.com/media/research.google.c...

Jun 27, 2017

It's kind of amazing how we have to have this discussion again every time somebody designs a CP system with excellent availability.

I'll just come out and say it: the 'A' in CAP is boring. It does not mean what you think it means. Lynch et al. probably chose the definition because it's one for which the 'theorem' is both true and easy to prove. This is not the impossibility result with which designers of distributed systems should be most concerned.

My heuristic these days is that worrying about the CAP theorem is a weak negative signal. (EDIT: This is not a statement about CockroachDB's post, which doubtless is designed to reassure customers who are misinformed on the topic. I'm familiar with that situation, and it makes me feel a deep sympathy for them.)

(Disclosure: I work on a CockroachDB competitor. Also none of this is Google's official position, etc., etc. For that, here's the whitepaper by Eric Brewer that we released along with the Cloud Spanner beta launch https://static.googleusercontent.com/media/research.google.c...).

May 14, 2017

The bigger issue is that you need Google's incredible inter-DC networking, which in practice makes partitions very rare. Eric Brewer (author of CAP theorem) lays out here [0] how Spanner relies on those networking guarantees to be effectively CA.

Google's inter-dc traffic flows entirely on private links rather than on the public internet, which is very hard for any other company to match on a global scale.

[0] https://static.googleusercontent.com/media/research.google.c...

Apr 06, 2017

Calvin is still a CP system, so nodes outside of the quorum cannot proceed. The point however is that partitions are rare enough that a CP system can still provide a high level of availability, despite the theoretical limitation. Eric Brewer, who came up with the CAP theorem explicitly makes this point here: https://static.googleusercontent.com/media/research.google.c...