Blog/Beyond CAP: Analyzing Systems with the PACELC Theorem
distributed-systemscap-theorempacelcconsistency

Beyond CAP: Analyzing Systems with the PACELC Theorem

May 23, 2024·11 min read·by Bishwambhar Sen
A system design grid mapping database engines along Availability-Consistency-Latency axes under partitioned and normal operations.

Concept

For over two decades, the CAP Theorem (Consistency, Availability, Partition Tolerance) has been the dominant framework for reasoning about distributed database trade-offs. Popularized by Eric Brewer, it asserts that in the presence of a network Partition (P), a system must choose between Consistency (C) (returning errors or blocking to avoid stale data) and Availability (A) (responding to queries with the local, potentially stale state).

However, the CAP theorem has a significant limitation: it only describes system behavior when a network partition occurs. Network partitions are rare in well-managed datacenters. For $99.9%$ of a database’s operational lifetime, the network is functioning normally. CAP remains silent about the trade-offs architects must make during these normal operating conditions.

To bridge this gap, Daniel Abadi formulated the PACELC Theorem in 2012. It extends CAP by incorporating the cost of normal operations:

$$\text{If } \mathbf{P} \text{ (Partition), how does the system choose between } \mathbf{A} \text{ and } \mathbf{C}?$$ $$\text{Else } (\mathbf{E}), \text{ how does the system choose between } \mathbf{L} \text{ (Latency) and } \mathbf{C} \text{ (Consistency)}?$$

                   /--- Yes (P) ---> Choose Availability (A) or Consistency (C)
Network Partition?
                   \--- No (E)  ---> Choose Latency (L) or Consistency (C)

PACELC classifies databases into four primary categories based on how they handle partitions and normal operations:

1. PC/EC (Partition Consistency / Else Consistency)

These systems prioritize consistency at all times. If a partition occurs, they block writes to ensure data does not diverge (Consistency over Availability). Under normal operation, they use synchronous replication and consensus rounds to ensure reads always reflect the latest write (Consistency over Latency).

  • Examples: Google Spanner, CockroachDB, traditional RDBMSs with synchronous replication.

2. PA/EL (Partition Availability / Else Latency)

These systems prioritize availability and speed. If partitioned, they allow nodes to accept writes independently (Availability over Consistency), resolving conflicts later. Under normal operation, they replicate data asynchronously in the background, allowing reads and writes to return immediately without waiting for other nodes (Latency over Consistency).

  • Examples: Apache Cassandra, Amazon DynamoDB, Couchbase.

3. PC/EL (Partition Consistency / Else Latency)

These systems maintain consistency if the network splits by locking out minority partitions. However, under normal operation, they prioritize read/write speed by serving data from local nodes asynchronously, accepting the risk of serving stale reads.

  • Examples: MongoDB (with default write concern w:1 and primary reads), Redis (with replication).

4. PA/EC (Partition Availability / Else Consistency)

These systems attempt to maintain strict consistency under normal operation, but if a partition occurs, they fall back to allowing partitioned nodes to remain active, sacrificing consistency to preserve availability. This category is rare because the architectural complexity of maintaining consensus under normal operations is rarely paired with a willingness to allow silent data divergence during partitions.


Constraints

The PACELC trade-offs are governed by hard physical and network constraints:

The Speed of Light and Network Latency

The theoretical minimum latency for data to travel through fiber optic cable is approximately $1$ millisecond per $200$ kilometers. In a multi-region deployment spanning San Francisco to New York (approx. $4,100$ km), the round-trip time (RTT) cannot physically drop below $41$ milliseconds.

If a database chooses Else Consistency (EC), every write must be synchronously acknowledged by a majority of replicas across regions. The write latency is bounded by the network RTT of the closest quorum: $$\text{Latency}{\text{write}} \ge \text{RTT}{\text{quorum}}$$

If the database chooses Else Latency (EL), the write returns immediately after writing to the local node ($1\text{ ms}$), and the data propagates asynchronously: $$\text{Latency}_{\text{write}} \approx \text{Disk Write Latency}$$

PC/EC Write Path (Consensus Quorum required):
Client ---> Node A (Leader) -- (WAN RTT ~40ms) --> Node B (Follower) -- Acknowledge --> Node A ---> Client Success (Total ~42ms)

PA/EL Write Path (Asynchronous Replication):
Client ---> Node A (Leader) ---> Client Success (Total ~2ms)
           Node A -- (Asynchronous WAN replication in background) --> Node B (Follower)

Quorum Math (R + W > N)

In configurable distributed databases like Cassandra, consistency is controlled mathematically. Let $N$ be the replication factor, $W$ the number of nodes that must acknowledge a write, and $R$ the number of nodes that must respond to a read.

  • To guarantee strong consistency (EC): $$R + W > N$$ This formula ensures that the read set and the write set overlap by at least one node, guaranteeing the client reads the latest write. However, this configuration requires querying multiple nodes, increasing read latency to the speed of the slowest responding node in the quorum.
  • To prioritize latency (EL): $$R + W \le N$$ For example, $R=1, W=1$ with $N=3$. Writes are fast, and reads are fast, but they can return stale data.

Latching and Lock Contention

In a PC/EC system, maintaining consistency during concurrent updates requires distributed lock managers or consensus-level latching. As concurrency increases, threads block waiting for locks to release, causing latency to degrade exponentially (violating strict latency constraints).


Trade-offs

Selecting a database and configuring its replication settings is a direct exercise in PACELC alignment:

Database Default PACELC Configured PACELC Latency SLA Consistency Guarantee
Google Spanner PC/EC Cannot be altered High ($10\text{--}50\text{ ms}$ writes) External Consistency (Serializable) via TrueTime API.
Cassandra PA/EL PC/EC (using QUORUM / ALL consistency) Configurable ($2\text{ ms}$ default, $50\text{ ms}$ under quorum WAN) Tunable consistency. Eventual consistency default.
MongoDB PC/EL PC/EC (using majority write and read concern) Configurable ($1\text{ ms}$ default, $35\text{ ms}$ under majority WAN) Session consistency or eventual consistency.
Amazon DynamoDB PA/EL PC/EC (using strongly consistent reads) Low ($2\text{--}10\text{ ms}$) Strongly consistent reads double read unit costs.

Architectural Decision Tree

When choosing a database configuration:

  1. Does the business domain require absolute financial consistency?

    • Yes: Adopt PC/EC (e.g., Spanner, CockroachDB). Accept the latency overhead of multi-phase consensus.
    • No: Proceed to step 2.
  2. Is low read latency (under 10ms) the primary metric for user experience?

    • Yes: Adopt PA/EL (Cassandra, DynamoDB with eventually consistent reads). Handle stale data and conflicting writes (CRDTs, Last-Write-Wins) in the application layer.
    • No: Adopt PC/EL (MongoDB, local primary read/write). Protect writes from split-brain scenarios but allow reads to be scaled out to local secondaries to optimize latency.

Code

The following example shows how to configure a database client in C# dynamically to swap between PA/EL and PC/EC behaviors. It demonstrates querying MongoDB using different read preferences and write concerns to match the PACELC requirements of different application paths.

using System;
using System.Threading;
using System.Threading.Tasks;
using MongoDB.Bson;
using MongoDB.Driver;

namespace PacelcAnalysis
{
    public class OrderRecord
    {
        public string OrderId { get; set; } = null!;
        public string CustomerId { get; set; } = null!;
        public decimal Amount { get; set; }
        public DateTime Timestamp { get; set; }
    }

    public class DatabaseClusterClient
    {
        private readonly IMongoClient _client;
        private readonly IMongoDatabase _database;

        public DatabaseClusterClient(string connectionString, string dbName)
        {
            // Establish the connection pool to the distributed replica set
            _client = new MongoClient(connectionString);
            _database = _client.GetDatabase(dbName);
        }

        /// <summary>
        /// Executes a write and read with PC/EC settings:
        /// - Writes must be acknowledged by a majority of replicas.
        /// - Reads must execute directly on the primary node.
        /// Prioritizes Consistency over Latency during normal operations, and Consistency over Availability during partitions.
        /// </summary>
        public async Task ExecuteStronglyConsistentTransactionAsync(OrderRecord order, CancellationToken token)
        {
            // 1. Configure the collection for PC/EC (Strong Consistency)
            var stronglyConsistentCollection = _database.GetCollection<OrderRecord>("orders")
                .WithWriteConcern(WriteConcern.WMajority) // Wait for majority replica write confirmation
                .WithReadPreference(ReadPreference.Primary); // Read only from primary to guarantee no stale reads

            Console.WriteLine("[PC/EC Mode] Initiating transaction. Requiring majority acknowledgment...");
            
            var startTime = DateTime.UtcNow;
            
            // 2. Perform write
            await stronglyConsistentCollection.InsertOneAsync(order, new InsertOneOptions(), token);
            
            // 3. Read back immediately
            var filter = Builders<OrderRecord>.Filter.Eq(o => o.OrderId, order.OrderId);
            var result = await stronglyConsistentCollection.Find(filter).SingleOrDefaultAsync(token);
            
            var durationMs = (DateTime.UtcNow - startTime).TotalMilliseconds;
            Console.WriteLine($"[PC/EC Mode] Complete. Duration: {durationMs:F2}ms. Amount Verified: {result?.Amount}");
        }

        /// <summary>
        /// Executes a write and read with PA/EL settings:
        /// - Writes are acknowledged by a single local replica node and returned immediately.
        /// - Reads can be served from the nearest secondary node.
        /// Prioritizes Latency over Consistency during normal operations, and Availability over Consistency during partitions.
        /// </summary>
        public async Task ExecuteLatencyOptimizedTransactionAsync(OrderRecord order, CancellationToken token)
        {
            // 1. Configure the collection for PA/EL (Latency / Availability Optimized)
            var latencyOptimizedCollection = _database.GetCollection<OrderRecord>("orders")
                .WithWriteConcern(WriteConcern.W1) // Acknowledged by leader only
                .WithReadPreference(ReadPreference.Nearest); // Read from nearest secondary node to minimize network RTT

            Console.WriteLine("[PA/EL Mode] Initiating transaction. Returning on single-node write...");
            
            var startTime = DateTime.UtcNow;

            // 2. Perform write
            await latencyOptimizedCollection.InsertOneAsync(order, new InsertOneOptions(), token);

            // 3. Read back immediately (might return null or older version if replication lag is higher than read RTT)
            var filter = Builders<OrderRecord>.Filter.Eq(o => o.OrderId, order.OrderId);
            
            OrderRecord? result = null;
            int readAttempts = 0;
            
            while (result == null && readAttempts < 5)
            {
                readAttempts++;
                result = await latencyOptimizedCollection.Find(filter).SingleOrDefaultAsync(token);
                if (result == null)
                {
                    Console.WriteLine($"[PA/EL Mode] Read attempt #{readAttempts} returned stale data (Not yet replicated).");
                    await Task.Delay(10, token); // Small sleep before retry
                }
            }

            var durationMs = (DateTime.UtcNow - startTime).TotalMilliseconds;
            Console.WriteLine($"[PA/EL Mode] Complete. Duration: {durationMs:F2}ms. Read attempts: {readAttempts}. Amount Verified: {result?.Amount}");
        }
    }
}