Theoretical Foundations

Welcome to the curriculum workspace. Here you will find long-form technical guidelines outlining core architectural blueprints and implementation mechanics.

Module 5: Storage Paradigms & Database Mechanics

PHASE 2 — SYSTEM DESIGN FOUNDATIONS: Your application logic is only as fast as the data store beneath it. This module gives you the mechanical knowledge to choose the right database paradigm, design efficient indexes, interpret query plans, and architect replication and sharding topologies for scale. These fundamentals underpin Modules 6 (Caching), 11 (Sagas), and 12 (CQRS).

Introduction: The Data Persistence Selection Problem

There are hundreds of database products on the market. Every engineering team faces the same decision: which storage engine is right for this workload? The wrong choice can result in:

Write bottlenecks (choosing a read-optimized B-Tree index for a write-heavy log ingestion pipeline).
Query impossibility (choosing a key-value store for a workload that requires full-text search and aggregation).
Operational complexity (choosing a distributed wide-column store when a single-node PostgreSQL would have served the workload for 5 years without changes).

The correct framework for database selection starts with understanding the mechanical internals of how each storage engine works.

Section 1: How Databases Store Data

A. B-Tree Indexes (Read-Optimized)

B-Trees (Balanced Trees) are the foundational data structure behind most relational databases (PostgreSQL, MySQL, SQLite). A B-Tree is a self-balancing tree where each node can have multiple children:

                    [Root Node]
                    [30 | 70]
                   /    |    \
              [10|20]  [40|60]  [80|90]
             /  |  \   /  |  \   /  |  \
          [5] [15] [25][35][50][65][75][85][95]
          (Leaf nodes contain actual row pointers to disk)

B-Tree Mechanics:

Each node is sized to match the OS page size (typically 8KB or 16KB).
Finding a record requires traversing $O(\log_B N)$ nodes, where $B$ is the branching factor and $N$ is the total number of records.
For 1 billion records with $B=100$: depth = $\log_{100}(10^9) \approx 4.5$ — only 5 disk reads to find any record.

Write Path: Inserting a new value requires finding the correct leaf node and potentially rebalancing the tree (split operations). B-Tree writes are in-place updates — they overwrite existing disk blocks.

Composite Indexes and Index-Only Scans

A composite index on (user_id, created_at) stores keys sorted first by user_id, then by created_at:

-- This query can use the composite index efficiently
SELECT id, total FROM orders
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 20;

-- Execution plan: Index Scan using orders_user_created_idx
-- Index Cond: (user_id = 12345)
-- Index pages read: 3 (instead of full 50,000-row table scan)

An Index-Only Scan occurs when the query only needs columns that exist in the index — no heap access required:

-- Composite index on (user_id, status) covers this query entirely
SELECT COUNT(*) FROM orders WHERE user_id = 12345 AND status = 'completed';
-- Index Only Scan: heap fetches = 0

B. LSM-Trees (Write-Optimized)

Log-Structured Merge-Trees (LSM-Trees) are used by write-heavy databases: Apache Cassandra, RocksDB, LevelDB, ClickHouse.

Write Path:

Write Request
    |
    v
[MemTable] (in-memory sorted buffer)
    |
    | (when MemTable reaches ~64MB threshold)
    v
[SSTable 0] (immutable sorted file on disk)
    |
    | (background compaction: merge SSTables)
    v
[SSTable 1] + [SSTable 2] → [SSTable merged]

All writes go to an in-memory MemTable (a balanced tree like a Red-Black Tree).
When the MemTable exceeds a threshold, it is flushed to disk as an immutable SSTable (Sorted String Table).
Background compaction merges multiple SSTables to reclaim space and remove tombstoned deletes.

Read Path: Must check the MemTable + all SSTables (most recent wins). This is mitigated by Bloom Filters — probabilistic data structures that can definitively say "this key does NOT exist in this SSTable," reducing unnecessary disk reads.

Feature	B-Tree	LSM-Tree
Write Performance	Moderate (in-place updates, seek overhead)	Excellent (sequential appends)
Read Performance	Excellent (direct lookup via tree traversal)	Good (Bloom filter + compaction)
Space Amplification	Low	High (multiple versions until compaction)
Write Amplification	Moderate	High (data written multiple times during compaction)
Best For	OLTP reads, range queries	High-throughput writes, time-series, logs
Examples	PostgreSQL, MySQL, SQLite	Cassandra, RocksDB, LevelDB, ClickHouse

Section 2: SQL Internals — ACID & Transaction Isolation

A. ACID Properties

Every relational database guarantees ACID properties for transactions:

Property	Definition	Implementation Mechanism
Atomicity	All operations in a transaction succeed or all are rolled back	Write-Ahead Log (WAL)
Consistency	The database transitions from one valid state to another	Constraints, Foreign Keys, Triggers
Isolation	Concurrent transactions do not interfere with each other	MVCC (Multi-Version Concurrency Control)
Durability	Committed data survives crashes	WAL flushed to durable storage before commit confirmation

B. Multi-Version Concurrency Control (MVCC)

PostgreSQL uses MVCC to allow readers and writers to operate concurrently without blocking each other. Instead of locking rows, MVCC stores multiple versions of each row:

Row: user_id=42, name="Alice", xmin=100, xmax=null  (current version)
                                                      (written by transaction 100)

Transaction 101: UPDATE users SET name="Alicia" WHERE id=42
→ Creates NEW version: name="Alicia", xmin=101, xmax=null
→ Marks OLD version: xmin=100, xmax=101 (now expired)

Concurrent Transaction 99 (started before 101):
→ Reads version where xmin=100 (the old "Alice" row)
→ Does NOT see the "Alicia" update (snapshot isolation)

C. Transaction Isolation Levels

PostgreSQL supports four isolation levels that trade consistency for concurrency:

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Use Case
Read Uncommitted	Possible	Possible	Possible	Analytics (rare)
Read Committed	Not possible	Possible	Possible	Default (most OLTP)
Repeatable Read	Not possible	Not possible	Possible	Financial reports
Serializable	Not possible	Not possible	Not possible	Critical financial transactions

-- Set isolation level for a transaction
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  SELECT balance FROM accounts WHERE id = 42;
  UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;

Section 3: Query Plan Analysis — EXPLAIN ANALYZE

Before adding indexes or replicas, diagnose slow queries with EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > NOW() - INTERVAL '24 hours'
ORDER BY o.created_at DESC
LIMIT 100;

Output interpretation:

Sort  (cost=1234.56..1235.06 rows=100)  (actual time=45.123..45.145 rows=100)
  Sort Key: o.created_at DESC
  ->  Hash Join  (cost=450.00..1230.00 rows=200)  (actual time=12.345..44.678 rows=100)
        Hash Cond: (o.customer_id = c.id)
        ->  Index Scan on orders_status_created_idx  (cost=0.43..780.56 rows=200)
              Index Cond: ((status = 'pending') AND (created_at > ...))
        ->  Hash  (cost=200.00..200.00 rows=20000)  (actual time=8.456..8.456 rows=20000)
              ->  Seq Scan on customers  (cost=0.00..200.00 rows=20000)

Reading the plan:

Seq Scan on customers — full table scan on 20,000 customer rows. If customers grows to 2 million rows, this becomes the bottleneck.
Index Scan on orders_status_created_idx — efficient composite index usage.
actual time=45.123ms — total execution time.

Optimization: Add an index on customers.id (it may already be the PK) and ensure the orders composite index on (status, created_at) exists.

CREATE INDEX CONCURRENTLY orders_status_created_idx
ON orders(status, created_at DESC)
WHERE status IN ('pending', 'processing');
-- Partial index: only index relevant statuses, reducing index size by ~80%

Section 4: NoSQL Database Categories

A. Document Databases (MongoDB, CouchDB)

Store data as JSON/BSON documents. Schema-flexible — each document can have different fields.

Best for: Content management, user profiles, catalogs with variable attributes. Query pattern: Rich queries on nested document fields.

// MongoDB: store and query a user profile with variable fields
db.users.insertOne({
  _id: "user-123",
  name: "Alice Chen",
  role: "architect",
  certifications: ["MPC-Professional", "AWS-SA"],
  preferences: { theme: "dark", notifications: { email: true, push: false } }
});

db.users.find({
  "certifications": "MPC-Professional",
  "preferences.notifications.email": true
});

Limitation: No multi-document ACID transactions (available in MongoDB 4.0+ but expensive). Poor for complex relational queries with joins.

B. Key-Value Stores (Redis, DynamoDB)

The simplest database model: a hash map. Every value is retrieved by its exact key.

Best for: Session storage, caching, leaderboards, rate limiting. Query pattern: GET key, SET key value, DEL key. No query language.

# Redis: session storage
SET session:abc123 '{"userId":"user-42","role":"admin"}' EX 3600
GET session:abc123
TTL session:abc123  # → 3599

# Redis Sorted Set: real-time leaderboard
ZADD leaderboard 94850 "alice"
ZADD leaderboard 87340 "bob"
ZREVRANGE leaderboard 0 9 WITHSCORES  # Top 10 players

C. Wide-Column Stores (Apache Cassandra, HBase)

Data is organized in tables with rows and columns, but columns are dynamic per row — different rows can have different columns. Designed for massive write throughput and horizontal partitioning.

Best for: Time-series data, IoT sensor streams, user activity logs at petabyte scale.

Cassandra table: sensor_readings
Partition Key: device_id (determines which node stores this data)
Clustering Key: timestamp (sorts within partition)

device_id | timestamp           | temperature | battery | location
----------------------------------------------------------------------
sensor-1  | 2024-01-01 12:00:00 | 22.5        | 87%     | 37.7,122.4
sensor-1  | 2024-01-01 12:00:01 | 22.6        | 87%     | 37.7,122.4
sensor-2  | 2024-01-01 12:00:00 | 19.1        | 43%     | 51.5,-0.1

D. Graph Databases (Neo4j, Amazon Neptune)

Stores data as nodes (entities) and edges (relationships). Optimized for traversing highly connected data.

Best for: Social networks (friends-of-friends), recommendation engines, fraud detection, knowledge graphs.

// Neo4j Cypher: find friends-of-friends who share interests
MATCH (user:User {id: 'alice'})-[:FOLLOWS]->(friend:User)-[:FOLLOWS]->(fof:User)
WHERE NOT (user)-[:FOLLOWS]->(fof)
AND (fof)-[:INTERESTED_IN]->(:Topic {name: 'Systems Architecture'})
RETURN fof.name, COUNT(*) as mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10

SQL would require: Multiple self-joins with extremely high complexity for traversal beyond 2–3 hops.

Section 5: Replication Topologies

A. Single-Leader Replication

All writes go to a designated leader node. The leader replicates to follower nodes asynchronously.

graph TD
    W[Write Client] -->|Write| Leader[(Leader DB)]
    Leader -->|Async WAL stream| F1[(Follower 1)]
    Leader -->|Async WAL stream| F2[(Follower 2)]
    R1[Read Client A] -->|Read| F1
    R2[Read Client B] -->|Read| F2

Failover: If the leader fails, a follower is promoted via a consensus protocol (Raft). This process can take 30–60 seconds and may involve data loss (writes acknowledged by leader but not yet replicated to followers).

Replication Lag Formula: $$\text{Lag} = T_{\text{apply on follower}} - T_{\text{commit on leader}}$$ For busy leaders with high write volume, this lag can reach seconds or minutes, causing stale reads from followers.

B. Multi-Leader Replication

Multiple nodes accept writes, each acting as both a leader and a follower to other leaders.

Data Center A                    Data Center B
[Leader A] ←————async————→ [Leader B]
    ↑                              ↑
[Local Clients]            [Local Clients]

Write Conflict Problem: Alice updates config.maxUsers = 100 on Leader A. Bob simultaneously updates config.maxUsers = 200 on Leader B. When replicated, both leaders have divergent values. Last-Write-Wins (LWW) based on timestamp requires synchronized clocks — dangerous across data centers. See Module 7 (CAP Theorem) for resolution strategies.

C. Leaderless Replication (Dynamo-Style)

No single node is the master. Clients write to and read from multiple nodes in parallel using quorums.

$$W + R > N$$

Where $W$ = write quorum, $R$ = read quorum, $N$ = total replica count.

Example: $N=3, W=2, R=2$: Write must succeed on 2 of 3 replicas. Read must query 2 of 3 replicas. Guaranteed overlap ensures at least one node has the latest write.

Configuration	Consistency Guarantee	Availability
$W=3, R=1$ (N=3)	Strong read consistency	Low — all must ack writes
$W=1, R=3$ (N=3)	Strong read consistency	Low — all must be queried
$W=2, R=2$ (N=3)	Balanced	High — tolerates 1 node failure
$W=1, R=1$ (N=3)	No guarantee (eventual)	Very high

Section 6: Database Sharding

Sharding (horizontal partitioning) distributes rows across multiple independent database nodes. Each node is called a shard.

A. Range-Based Sharding

Shard 1: user_id 1       – 10,000,000
Shard 2: user_id 10,000,001 – 20,000,000
Shard 3: user_id 20,000,001 – 30,000,000

Problem: If most active users have IDs in the 10M–20M range (recently acquired users), Shard 2 receives the majority of traffic while Shards 1 and 3 sit idle. This is called a hot shard.

B. Hash-Based Sharding

shard_id = hash(user_id) % total_shards

hash("user-12345") % 4 = 1 → Shard 1
hash("user-67890") % 4 = 3 → Shard 3

Problem: Adding a new shard requires re-hashing almost all data. Increasing from 4 to 5 shards means: $$\text{Keys remapped} = \frac{N-1}{N} \approx 80%$$

Only ~20% of keys land on the same shard as before. This requires a massive data migration.

C. Consistent Hashing (Elastic Sharding)

Consistent Hashing maps both data keys and server nodes onto a circular ring (0 to $2^{32}$). A key is assigned to the nearest clockwise node:

              Key K5 (hash=280)
                    |
    Node C (300) ←--+  ← K5 assigned to Node C
   /
Ring: 0 ——— Node A (100) ——— Key K1 (150) ——— Node B (200) ——— Key K3 (250) ——— Node C (300) ——— 360/0

Adding a new node: Only keys between the new node and its predecessor need to move. With $N$ nodes, adding one moves only $\frac{1}{N+1}$ of total keys — a minimal migration.

Virtual Nodes: Each physical server is assigned multiple positions on the ring (virtual nodes), distributing load more evenly and allowing fine-grained rebalancing when nodes join or leave.

Section 7: Connection Pooling with pgBouncer

Every PostgreSQL connection consumes ~5-10MB of server memory and spawns a dedicated OS process. An application server with 100 concurrent queries needs 100 database connections — at 10MB each, that's 1GB of PostgreSQL memory just for connection overhead.

pgBouncer is a connection pooler that maintains a small pool of long-lived database connections, multiplexing thousands of application-side connections:

[App Server 1: 50 worker threads]
[App Server 2: 50 worker threads]   → [pgBouncer] → [PostgreSQL: 20 connections]
[App Server 3: 50 worker threads]

150 application connections → multiplexed to → 20 database connections

pgBouncer configuration:

[databases]
macropatterns_prod = host=db.internal port=5432 dbname=mpc_production

[pgbouncer]
pool_mode = transaction   # Connection held only during a transaction
max_client_conn = 1000    # Max app-side connections
default_pool_size = 20    # Max backend DB connections
min_pool_size = 5
reserve_pool_size = 5

In transaction pool mode, a connection is returned to the pool the moment a transaction commits, allowing 20 database connections to serve thousands of concurrent requests.

Section 8: Hands-On ERD Challenge

The Challenge

Design an Entity-Relationship Diagram showing a sharded order history database. The diagram must show:

A user_shard_map lookup table mapping user_id to shard_id.
A USERS table with a shard_id column for co-location.
An ORDERS table with a foreign key to USERS and its own shard_id.
Crow's foot notation showing cardinality.

Solution Model

erDiagram
    USER_SHARD_MAP {
        string user_id PK
        int    shard_id
        string region
    }

    USERS {
        string  user_id     PK
        int     shard_id    FK
        string  email
        string  name
        string  plan
        timestamp created_at
    }

    ORDERS {
        string    order_id    PK
        string    user_id     FK
        int       shard_id
        decimal   total
        string    status
        timestamp created_at
    }

    ORDER_ITEMS {
        string  item_id   PK
        string  order_id  FK
        string  product_id
        int     quantity
        decimal unit_price
    }

    USER_SHARD_MAP ||--o{ USERS : "routes to shard"
    USERS ||--o{ ORDERS : "places"
    ORDERS ||--|{ ORDER_ITEMS : "contains"

Key architectural decisions:

shard_id is denormalized onto both USERS and ORDERS to enable shard-local JOINs without cross-shard queries.
user_shard_map is a small, globally replicated lookup table (hot data), allowing any application server to route a request to the correct shard.
ORDER_ITEMS are co-located on the same shard as their parent order — no cross-shard joins needed.

Bridge to Module 6: Now that you can store data efficiently, you need to protect your database from being hit on every read request. Module 6 (Caching Topologies) covers the layered caching strategies — from browser caches to distributed Redis clusters — that keep your database at acceptable load levels as traffic grows.

Theoretical Foundations

Welcome to the curriculum workspace. Here you will find long-form technical guidelines outlining core architectural blueprints and implementation mechanics.

Module 5: Storage Paradigms & Database Mechanics

PHASE 2 — SYSTEM DESIGN FOUNDATIONS: Your application logic is only as fast as the data store beneath it. This module gives you the mechanical knowledge to choose the right database paradigm, design efficient indexes, interpret query plans, and architect replication and sharding topologies for scale. These fundamentals underpin Modules 6 (Caching), 11 (Sagas), and 12 (CQRS).

Introduction: The Data Persistence Selection Problem

There are hundreds of database products on the market. Every engineering team faces the same decision: which storage engine is right for this workload? The wrong choice can result in:

Write bottlenecks (choosing a read-optimized B-Tree index for a write-heavy log ingestion pipeline).
Query impossibility (choosing a key-value store for a workload that requires full-text search and aggregation).
Operational complexity (choosing a distributed wide-column store when a single-node PostgreSQL would have served the workload for 5 years without changes).

The correct framework for database selection starts with understanding the mechanical internals of how each storage engine works.

Section 1: How Databases Store Data

A. B-Tree Indexes (Read-Optimized)

                    [Root Node]
                    [30 | 70]
                   /    |    \
              [10|20]  [40|60]  [80|90]
             /  |  \   /  |  \   /  |  \
          [5] [15] [25][35][50][65][75][85][95]
          (Leaf nodes contain actual row pointers to disk)

B-Tree Mechanics:

Each node is sized to match the OS page size (typically 8KB or 16KB).
Finding a record requires traversing $O(\log_B N)$ nodes, where $B$ is the branching factor and $N$ is the total number of records.
For 1 billion records with $B=100$: depth = $\log_{100}(10^9) \approx 4.5$ — only 5 disk reads to find any record.

Composite Indexes and Index-Only Scans

A composite index on (user_id, created_at) stores keys sorted first by user_id, then by created_at:

-- This query can use the composite index efficiently
SELECT id, total FROM orders
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 20;

-- Execution plan: Index Scan using orders_user_created_idx
-- Index Cond: (user_id = 12345)
-- Index pages read: 3 (instead of full 50,000-row table scan)

An Index-Only Scan occurs when the query only needs columns that exist in the index — no heap access required:

-- Composite index on (user_id, status) covers this query entirely
SELECT COUNT(*) FROM orders WHERE user_id = 12345 AND status = 'completed';
-- Index Only Scan: heap fetches = 0

B. LSM-Trees (Write-Optimized)

Log-Structured Merge-Trees (LSM-Trees) are used by write-heavy databases: Apache Cassandra, RocksDB, LevelDB, ClickHouse.

Write Path:

Write Request
    |
    v
[MemTable] (in-memory sorted buffer)
    |
    | (when MemTable reaches ~64MB threshold)
    v
[SSTable 0] (immutable sorted file on disk)
    |
    | (background compaction: merge SSTables)
    v
[SSTable 1] + [SSTable 2] → [SSTable merged]

All writes go to an in-memory MemTable (a balanced tree like a Red-Black Tree).
When the MemTable exceeds a threshold, it is flushed to disk as an immutable SSTable (Sorted String Table).
Background compaction merges multiple SSTables to reclaim space and remove tombstoned deletes.

Feature	B-Tree	LSM-Tree
Write Performance	Moderate (in-place updates, seek overhead)	Excellent (sequential appends)
Read Performance	Excellent (direct lookup via tree traversal)	Good (Bloom filter + compaction)
Space Amplification	Low	High (multiple versions until compaction)
Write Amplification	Moderate	High (data written multiple times during compaction)
Best For	OLTP reads, range queries	High-throughput writes, time-series, logs
Examples	PostgreSQL, MySQL, SQLite	Cassandra, RocksDB, LevelDB, ClickHouse

Section 2: SQL Internals — ACID & Transaction Isolation

A. ACID Properties

Every relational database guarantees ACID properties for transactions:

Property	Definition	Implementation Mechanism
Atomicity	All operations in a transaction succeed or all are rolled back	Write-Ahead Log (WAL)
Consistency	The database transitions from one valid state to another	Constraints, Foreign Keys, Triggers
Isolation	Concurrent transactions do not interfere with each other	MVCC (Multi-Version Concurrency Control)
Durability	Committed data survives crashes	WAL flushed to durable storage before commit confirmation

B. Multi-Version Concurrency Control (MVCC)

PostgreSQL uses MVCC to allow readers and writers to operate concurrently without blocking each other. Instead of locking rows, MVCC stores multiple versions of each row:

Row: user_id=42, name="Alice", xmin=100, xmax=null  (current version)
                                                      (written by transaction 100)

Transaction 101: UPDATE users SET name="Alicia" WHERE id=42
→ Creates NEW version: name="Alicia", xmin=101, xmax=null
→ Marks OLD version: xmin=100, xmax=101 (now expired)

Concurrent Transaction 99 (started before 101):
→ Reads version where xmin=100 (the old "Alice" row)
→ Does NOT see the "Alicia" update (snapshot isolation)

C. Transaction Isolation Levels

PostgreSQL supports four isolation levels that trade consistency for concurrency:

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Use Case
Read Uncommitted	Possible	Possible	Possible	Analytics (rare)
Read Committed	Not possible	Possible	Possible	Default (most OLTP)
Repeatable Read	Not possible	Not possible	Possible	Financial reports
Serializable	Not possible	Not possible	Not possible	Critical financial transactions

-- Set isolation level for a transaction
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  SELECT balance FROM accounts WHERE id = 42;
  UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;

Section 3: Query Plan Analysis — EXPLAIN ANALYZE

Before adding indexes or replicas, diagnose slow queries with EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > NOW() - INTERVAL '24 hours'
ORDER BY o.created_at DESC
LIMIT 100;

Output interpretation:

Sort  (cost=1234.56..1235.06 rows=100)  (actual time=45.123..45.145 rows=100)
  Sort Key: o.created_at DESC
  ->  Hash Join  (cost=450.00..1230.00 rows=200)  (actual time=12.345..44.678 rows=100)
        Hash Cond: (o.customer_id = c.id)
        ->  Index Scan on orders_status_created_idx  (cost=0.43..780.56 rows=200)
              Index Cond: ((status = 'pending') AND (created_at > ...))
        ->  Hash  (cost=200.00..200.00 rows=20000)  (actual time=8.456..8.456 rows=20000)
              ->  Seq Scan on customers  (cost=0.00..200.00 rows=20000)

Reading the plan:

Seq Scan on customers — full table scan on 20,000 customer rows. If customers grows to 2 million rows, this becomes the bottleneck.
Index Scan on orders_status_created_idx — efficient composite index usage.
actual time=45.123ms — total execution time.

Optimization: Add an index on customers.id (it may already be the PK) and ensure the orders composite index on (status, created_at) exists.

CREATE INDEX CONCURRENTLY orders_status_created_idx
ON orders(status, created_at DESC)
WHERE status IN ('pending', 'processing');
-- Partial index: only index relevant statuses, reducing index size by ~80%

Section 4: NoSQL Database Categories

A. Document Databases (MongoDB, CouchDB)

Store data as JSON/BSON documents. Schema-flexible — each document can have different fields.

Best for: Content management, user profiles, catalogs with variable attributes. Query pattern: Rich queries on nested document fields.

// MongoDB: store and query a user profile with variable fields
db.users.insertOne({
  _id: "user-123",
  name: "Alice Chen",
  role: "architect",
  certifications: ["MPC-Professional", "AWS-SA"],
  preferences: { theme: "dark", notifications: { email: true, push: false } }
});

db.users.find({
  "certifications": "MPC-Professional",
  "preferences.notifications.email": true
});

Limitation: No multi-document ACID transactions (available in MongoDB 4.0+ but expensive). Poor for complex relational queries with joins.

B. Key-Value Stores (Redis, DynamoDB)

The simplest database model: a hash map. Every value is retrieved by its exact key.

Best for: Session storage, caching, leaderboards, rate limiting. Query pattern: GET key, SET key value, DEL key. No query language.

# Redis: session storage
SET session:abc123 '{"userId":"user-42","role":"admin"}' EX 3600
GET session:abc123
TTL session:abc123  # → 3599

# Redis Sorted Set: real-time leaderboard
ZADD leaderboard 94850 "alice"
ZADD leaderboard 87340 "bob"
ZREVRANGE leaderboard 0 9 WITHSCORES  # Top 10 players

C. Wide-Column Stores (Apache Cassandra, HBase)

Best for: Time-series data, IoT sensor streams, user activity logs at petabyte scale.

Cassandra table: sensor_readings
Partition Key: device_id (determines which node stores this data)
Clustering Key: timestamp (sorts within partition)

device_id | timestamp           | temperature | battery | location
----------------------------------------------------------------------
sensor-1  | 2024-01-01 12:00:00 | 22.5        | 87%     | 37.7,122.4
sensor-1  | 2024-01-01 12:00:01 | 22.6        | 87%     | 37.7,122.4
sensor-2  | 2024-01-01 12:00:00 | 19.1        | 43%     | 51.5,-0.1

D. Graph Databases (Neo4j, Amazon Neptune)

Stores data as nodes (entities) and edges (relationships). Optimized for traversing highly connected data.

Best for: Social networks (friends-of-friends), recommendation engines, fraud detection, knowledge graphs.

// Neo4j Cypher: find friends-of-friends who share interests
MATCH (user:User {id: 'alice'})-[:FOLLOWS]->(friend:User)-[:FOLLOWS]->(fof:User)
WHERE NOT (user)-[:FOLLOWS]->(fof)
AND (fof)-[:INTERESTED_IN]->(:Topic {name: 'Systems Architecture'})
RETURN fof.name, COUNT(*) as mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10

SQL would require: Multiple self-joins with extremely high complexity for traversal beyond 2–3 hops.

Section 5: Replication Topologies

A. Single-Leader Replication

All writes go to a designated leader node. The leader replicates to follower nodes asynchronously.

graph TD
    W[Write Client] -->|Write| Leader[(Leader DB)]
    Leader -->|Async WAL stream| F1[(Follower 1)]
    Leader -->|Async WAL stream| F2[(Follower 2)]
    R1[Read Client A] -->|Read| F1
    R2[Read Client B] -->|Read| F2

B. Multi-Leader Replication

Multiple nodes accept writes, each acting as both a leader and a follower to other leaders.

Data Center A                    Data Center B
[Leader A] ←————async————→ [Leader B]
    ↑                              ↑
[Local Clients]            [Local Clients]

C. Leaderless Replication (Dynamo-Style)

No single node is the master. Clients write to and read from multiple nodes in parallel using quorums.

$$W + R > N$$

Where $W$ = write quorum, $R$ = read quorum, $N$ = total replica count.

Example: $N=3, W=2, R=2$: Write must succeed on 2 of 3 replicas. Read must query 2 of 3 replicas. Guaranteed overlap ensures at least one node has the latest write.

Configuration	Consistency Guarantee	Availability
$W=3, R=1$ (N=3)	Strong read consistency	Low — all must ack writes
$W=1, R=3$ (N=3)	Strong read consistency	Low — all must be queried
$W=2, R=2$ (N=3)	Balanced	High — tolerates 1 node failure
$W=1, R=1$ (N=3)	No guarantee (eventual)	Very high

Section 6: Database Sharding

Sharding (horizontal partitioning) distributes rows across multiple independent database nodes. Each node is called a shard.

A. Range-Based Sharding

Shard 1: user_id 1       – 10,000,000
Shard 2: user_id 10,000,001 – 20,000,000
Shard 3: user_id 20,000,001 – 30,000,000

Problem: If most active users have IDs in the 10M–20M range (recently acquired users), Shard 2 receives the majority of traffic while Shards 1 and 3 sit idle. This is called a hot shard.

B. Hash-Based Sharding

shard_id = hash(user_id) % total_shards

hash("user-12345") % 4 = 1 → Shard 1
hash("user-67890") % 4 = 3 → Shard 3

Problem: Adding a new shard requires re-hashing almost all data. Increasing from 4 to 5 shards means: $$\text{Keys remapped} = \frac{N-1}{N} \approx 80%$$

Only ~20% of keys land on the same shard as before. This requires a massive data migration.

C. Consistent Hashing (Elastic Sharding)

Consistent Hashing maps both data keys and server nodes onto a circular ring (0 to $2^{32}$). A key is assigned to the nearest clockwise node:

              Key K5 (hash=280)
                    |
    Node C (300) ←--+  ← K5 assigned to Node C
   /
Ring: 0 ——— Node A (100) ——— Key K1 (150) ——— Node B (200) ——— Key K3 (250) ——— Node C (300) ——— 360/0

Adding a new node: Only keys between the new node and its predecessor need to move. With $N$ nodes, adding one moves only $\frac{1}{N+1}$ of total keys — a minimal migration.

Virtual Nodes: Each physical server is assigned multiple positions on the ring (virtual nodes), distributing load more evenly and allowing fine-grained rebalancing when nodes join or leave.

Section 7: Connection Pooling with pgBouncer

pgBouncer is a connection pooler that maintains a small pool of long-lived database connections, multiplexing thousands of application-side connections:

[App Server 1: 50 worker threads]
[App Server 2: 50 worker threads]   → [pgBouncer] → [PostgreSQL: 20 connections]
[App Server 3: 50 worker threads]

150 application connections → multiplexed to → 20 database connections

pgBouncer configuration:

[databases]
macropatterns_prod = host=db.internal port=5432 dbname=mpc_production

[pgbouncer]
pool_mode = transaction   # Connection held only during a transaction
max_client_conn = 1000    # Max app-side connections
default_pool_size = 20    # Max backend DB connections
min_pool_size = 5
reserve_pool_size = 5

In transaction pool mode, a connection is returned to the pool the moment a transaction commits, allowing 20 database connections to serve thousands of concurrent requests.

Section 8: Hands-On ERD Challenge

The Challenge

Design an Entity-Relationship Diagram showing a sharded order history database. The diagram must show:

A user_shard_map lookup table mapping user_id to shard_id.
A USERS table with a shard_id column for co-location.
An ORDERS table with a foreign key to USERS and its own shard_id.
Crow's foot notation showing cardinality.

Solution Model

erDiagram
    USER_SHARD_MAP {
        string user_id PK
        int    shard_id
        string region
    }

    USERS {
        string  user_id     PK
        int     shard_id    FK
        string  email
        string  name
        string  plan
        timestamp created_at
    }

    ORDERS {
        string    order_id    PK
        string    user_id     FK
        int       shard_id
        decimal   total
        string    status
        timestamp created_at
    }

    ORDER_ITEMS {
        string  item_id   PK
        string  order_id  FK
        string  product_id
        int     quantity
        decimal unit_price
    }

    USER_SHARD_MAP ||--o{ USERS : "routes to shard"
    USERS ||--o{ ORDERS : "places"
    ORDERS ||--|{ ORDER_ITEMS : "contains"

Key architectural decisions:

shard_id is denormalized onto both USERS and ORDERS to enable shard-local JOINs without cross-shard queries.
user_shard_map is a small, globally replicated lookup table (hot data), allowing any application server to route a request to the correct shard.
ORDER_ITEMS are co-located on the same shard as their parent order — no cross-shard joins needed.

Module 5: Storage Paradigms & Database Mechanics

Theoretical Foundations

Module 5: Storage Paradigms & Database Mechanics

Introduction: The Data Persistence Selection Problem

Section 1: How Databases Store Data

A. B-Tree Indexes (Read-Optimized)

Composite Indexes and Index-Only Scans

B. LSM-Trees (Write-Optimized)

Section 2: SQL Internals — ACID & Transaction Isolation

A. ACID Properties

B. Multi-Version Concurrency Control (MVCC)

C. Transaction Isolation Levels

Section 3: Query Plan Analysis — EXPLAIN ANALYZE

Section 4: NoSQL Database Categories

A. Document Databases (MongoDB, CouchDB)

B. Key-Value Stores (Redis, DynamoDB)

C. Wide-Column Stores (Apache Cassandra, HBase)

D. Graph Databases (Neo4j, Amazon Neptune)

Section 5: Replication Topologies

A. Single-Leader Replication

B. Multi-Leader Replication

C. Leaderless Replication (Dynamo-Style)

Section 6: Database Sharding

A. Range-Based Sharding

B. Hash-Based Sharding

C. Consistent Hashing (Elastic Sharding)

Section 7: Connection Pooling with pgBouncer

Section 8: Hands-On ERD Challenge

The Challenge

Solution Model

Module Deliverables

Module 5: Storage Paradigms & Database Mechanics

Theoretical Foundations

Module 5: Storage Paradigms & Database Mechanics

Introduction: The Data Persistence Selection Problem

Section 1: How Databases Store Data

A. B-Tree Indexes (Read-Optimized)

Composite Indexes and Index-Only Scans

B. LSM-Trees (Write-Optimized)

Section 2: SQL Internals — ACID & Transaction Isolation

A. ACID Properties

B. Multi-Version Concurrency Control (MVCC)

C. Transaction Isolation Levels

Section 3: Query Plan Analysis — EXPLAIN ANALYZE

Section 4: NoSQL Database Categories

A. Document Databases (MongoDB, CouchDB)

B. Key-Value Stores (Redis, DynamoDB)

C. Wide-Column Stores (Apache Cassandra, HBase)

D. Graph Databases (Neo4j, Amazon Neptune)

Section 5: Replication Topologies

A. Single-Leader Replication

B. Multi-Leader Replication

C. Leaderless Replication (Dynamo-Style)

Section 6: Database Sharding

A. Range-Based Sharding

B. Hash-Based Sharding

C. Consistent Hashing (Elastic Sharding)

Section 7: Connection Pooling with pgBouncer

Section 8: Hands-On ERD Challenge

The Challenge

Solution Model

Module Deliverables