In part 1 of this series of blog posts, we talked about how the choice between NoSQL and SQL databases is bound to the core design of the application and I promised to get deeper into what this means. We started by looking into how support for a flexible schema is both advantageous and challenging. In this post, I will discuss CAP theorem and explain how it affects both the choice of the database technology and the application logic. Understanding CAP theorem and its implications is very important in designing a distributed system.
As a software engineer who is dealing with databases, you’ve probably already heard of CAP theorem. However, chances are that you’ve been exposed to the one-liner definition – as I was when I first started working with NoSQL. As a matter of fact, it sounds pretty obvious at first, so not all of us feel the need to dig deeper. But, I assure you, the better you understand CAP theorem, the more you can utilize the concept to match your requirements, which goes a long way in avoiding design problems.
In a distributed system, partitions form when the replica servers can’t talk to each other due to a communication failure (although each side of the partition is still available to serve requests). Considering that partitions are unavoidable in a distributed system, and you can’t sacrifice partition tolerance, the CAP theorem states that in the event of a partition, you need to choose between either availability or consistency. Let’s first clarify these terms in the context of CAP.
Consistency means a read request must receive a single copy of the data from all servers in the distributed system, and should reflect the latest data.
A system is considered (highly) available if all read and write requests receive a timely non-failure response at all times. Note the word ‘timely’ in this definition; we will come back to that. For now, assume it means in real time.
Sacrificing Consistency or Availability:
Based upon the above definitions, CAP theorem indicates that in the event of a partition we have to choose between two cases:
- Prefer consistency over availability: the disconnected replicas should not accept new requests, until they can communicate, so that the latest status of data can spread to all replicas.
- Prefer availability over consistency: all servers should continue responding to incoming requests with the risk of not knowing the latest state of the data that is being gathered by the other side of partition.
This is the core of CAP theorem and it is the main decision to be made when choosing a database technology. SQL databases clearly follow ACID properties which indicates consistency is a big part of their design. NoSQL databases generally fall into two categories; most prefer availability (known as AP systems), and some prefer consistency (known as CP systems). There are also some databases that don’t fall into a clear category regarding CAP.
The careful paradigm shift
The idea of sacrificing consistency sounds alien to most of us. At the same time, we’re being taught over and over that the new world is the world of availability. As a result, there seems to be a paradigm shift in the world of application data handling. On the two extremes of this paradigm shift, I see two equally damaging schools of thoughts. One is the quick conclusion that using traditional less-available systems is the only way to achieve consistency. The other is the quick conclusion that availability is the number one concern of all modern applications, and that they should learn to live with inconsistency in the data.
I would challenge the first school of thought by clarifying that choosing availability doesn’t mean losing control of consistency. It just makes it more challenging to come up with a good design. One that leads to only rare cases of inconsistency or loss of data, and which need to be planned for. In addition, a system with little writes and heavy reads is less prone to inconsistency while it can benefit heavily from high availability. Also note that not all NoSQL databases favor availability over consistency, and sometimes they provide parameters to adjust system’s behavior. To challenge the second school of thought, I don’t think all applications have huge availability requirements, despite what their owners wish for. There are many situations where we can choose the cheaper solution of choosing built-in consistency for an acceptable level of availability for our application requirements.
Digital or Analogue?
When I was explaining CAP theorem, I started with the assumption that partitions happen when two replicas can’t communicate to each other due to network failure. In such situations, if a request waits for a network link to come online, it has a big draw on availability. That’s when consistency and availability clearly oppose each other.
Earlier in this post, we defined availability as receiving ‘real-time’ response. If the application requires real-time availability, then not only network failures, but also low latency can be held against consistency. A highly available system can’t wait for replications to complete while a request is waiting to receive the latest data. For such analysis, CAP theorem can be extended to include low latency as a form of partition forming.
Depending on the weight you put on availability and consistency requirements, these could be slightly adjusted against each other. Depending on the requirements, accepting some low latency could be beneficial considering it can significantly reduce chances of inconsistency in data. The system’s behavior in case of low latency, versus a case of network failure is what Professor Abadi describes in his PACELC theorem.
Although consistency may need to be overlooked while a request is being served, the system can recover to a consistent state later on, as the next section briefly explains. Note that when we talk about ‘recovery to a consistent state’, it means that all servers can eventually return the same result for a single request. It doesn’t guarantee that no data is lost in the process to reach this consistent state.
Strategy for eventual consistency
Professor Brewer, who originated the CAP theorem, has a famous article in which he explains how designers can handle recovery from partitions in terms of inconsistencies. Considering the recommendations in his article, you should consult the architecture and implementation of each distributed database to understand the built-in partition recovery or eventual consistency strategy. Or, to understand what information is available in order to implement your own recovery mechanism.
To better illustrate this, let’s consider CouchDB’s eventual consistency mechanism. CouchDB creates a new document with a new revision number for each write operation. It also constantly replicates the documents across nodes. When two parallel write requests are received at two different replicas, both writes are preserved by their corresponding revision numbers. When these documents are replicated, CouchDB needs to resolve the conflict by selecting one of these documents as the latest version, to be returned for next read operation. In some cases none of the existing document versions is a true ‘latest correct version’. So, the document that is chosen by CouchDB is not necessarily the one you would have logically expected. This can happen when the two update operations have altered separate sections of the document. Depending on the application logic, you may want to preserve both changes. In this situation, your application needs to implement its own recovery mechanism by retrieving multiple revisions of the document and merging the data, or selecting an older version as the latest.
One recommended practice in order to avoid many conflicts, especially the kind that needs data merging, is to design your documents in such a way that each contains as little data as possible by splitting objects into smaller coherent objects. This reduces the chance of updates on different fields within a document and hence reduces the chance of overwriting an update. However, this design choice should be made as trade-off with the possibility of needing more inter-document joins and also the transaction requirements as will be discussed in one of my upcoming blog posts.
To wrap up
In the landscape of distributed databases (mostly NoSQL) there are many variations of architectures. According to both CAP and PACELC theorems, database systems should prefer either availability or consistency in case of partitions or low latency. Making this choice is one of the main aspects of selecting the right database system for an application. It first requires a careful understanding of the application’s logical requirements in terms of consistency and data loss, as well as its expected SLA in terms of availability and amount of data altering traffic. Another thing to consider is the mechanism provided to the application to decide the eventual consistency mechanism. Note that some databases let the application choose between availability and consistency per use case, so the decision does not need to be across the entire application logic.
In an upcoming post we will continue the discussion of NoSQL versus SQL by talking about scalability and transaction support. Stay tuned!