CAP Theorem / Brewer’s theorem – System Design Notes

The CAP theorem, also known as Brewer’s theorem states that in a distributed data store, it’s impossible to simultaneously provide more than two out of the following three guarantees:

This theorem is important for architects designing distributed systems because it forces a conscious trade-off between these critical properties. Let’s examine each aspect:

Consistency Explained

Consistency, in the context of the CAP theorem, means data integrity. It ensures that all nodes in the system have a consistent view of the data at any given time. This means if a write operation occurs, all subsequent read operations will return the updated value. There’s no stale data.

A simple example: Imagine a banking system. Consistency ensures that if you transfer money from one account to another, all nodes in the system reflect the updated balances immediately. Any inconsistency could lead to severe financial problems.

graph LR
   A[Client] --> B[Write to Node 1]
   B --> C[Network Partition]
   C --> D[Nodes Consistent]
   C --> E[Data Inconsistent]
   B --> F[Read from Node 2]
   F -.-> G[Inconsistency]
   style G fill:#f9f,stroke:#333,stroke-width:2px

This diagram shows how a network partition can lead to inconsistent data if the system doesn’t handle it correctly.

Availability Explained

Availability means that the system is always operational and responsive to requests. Every request receives a response, regardless of whether the data is completely consistent. This is vital for systems that require high uptime, even at the cost of some data consistency.

Continuing the banking example: High availability ensures that customers can always access their accounts and perform transactions, even if a temporary network glitch affects a small part of the system. While the system aims for consistency, temporarily inconsistent data might be tolerated for uninterrupted service.

Partition Tolerance Explained

Partition tolerance addresses the reality of distributed systems – networks fail. This guarantee means that the system continues to function even if parts of the network become disconnected. This is arguably the most important guarantee of the three in a distributed setting, as network partitions are inevitable.

In our banking system: If a network partition separates geographically distinct data centers, partition tolerance ensures that each data center continues to operate independently, serving its customers. Reconciliation of data across partitions happens later when the network is restored.

The CAP Trade-offs

The core of the CAP theorem lies in the fact that you can’t have all three guarantees simultaneously. The choice depends entirely on the application’s requirements:

CA (Consistency and Availability): This is generally achieved by avoiding network partitions. This often means having a single point of failure, which violates the spirit of a distributed system. Suitable for systems with low traffic and where consistency is paramount. A potential solution might be leader election algorithms for a single active node.

CP (Consistency and Partition Tolerance): This is the choice for systems prioritizing data consistency over availability. During a network partition, some nodes may become unavailable to maintain data consistency. Examples include database systems that utilize distributed consensus algorithms like Paxos or Raft.

AP (Availability and Partition Tolerance): This approach sacrifices consistency for availability. During network partitions, nodes might continue to operate independently, potentially resulting in temporary inconsistencies. This is often seen in systems like DNS and NoSQL databases. Techniques like eventual consistency are employed.

Example Implementations and Code Snippets (Conceptual)

Illustrating practical implementation details in a short blog post is difficult because such systems are inherently complex. However, we can sketch the core concepts:

Imagine a simple key-value store using a distributed consensus algorithm like Paxos:


class DistributedKVStore:
    def __init__(self, nodes):
        self.nodes = nodes  # List of node addresses

    def put(self, key, value):
        # Use Paxos to ensure consistent write across all nodes
        # ... Paxos implementation ...
        pass

    def get(self, key):
        # Read from a majority of nodes to ensure consistency
        # ... Paxos implementation ...
        pass

This code hints at the use of a distributed consensus algorithm to guarantee consistency even during a partition (at the cost of availability during the partition).


class EventualKVStore:
    def __init__(self, nodes):
        self.nodes = nodes # List of node addresses

    def put(self, key, value, node_id):
        # Write to the local node
        self.local_store[key] = value
        # Eventually propagate to other nodes asynchronously
        # ... Asynchronous replication ...
        pass

    def get(self, key, node_id):
        # Read from the local node
        return self.local_store.get(key)

Here, data is written locally, and eventually propagated – leading to availability but potential inconsistency until propagation is complete.