Theoretical Foundations
Welcome to the curriculum workspace. Here you will find long-form technical guidelines outlining core architectural blueprints and implementation mechanics.
Module 5: Storage Paradigms & Database Mechanics
PHASE 2 — SYSTEM DESIGN FOUNDATIONS: Your application logic is only as fast as the data store beneath it. This module gives you the mechanical knowledge to choose the right database paradigm, design efficient indexes, interpret query plans, and architect replication and sharding topologies for scale. These fundamentals underpin Modules 6 (Caching), 11 (Sagas), and 12 (CQRS).
Introduction: The Data Persistence Selection Problem
There are hundreds of database products on the market. Every engineering team faces the same decision: which storage engine is right for this workload? The wrong choice can result in:
- Write bottlenecks (choosing a read-optimized B-Tree index for a write-heavy log ingestion pipeline).
- Query impossibility (choosing a key-value store for a workload that requires full-text search and aggregation).
- Operational complexity (choosing a distributed wide-column store when a single-node PostgreSQL would have served the workload for 5 years without changes).
The correct framework for database selection starts with understanding the mechanical internals of how each storage engine works.
Section 1: How Databases Store Data
A. B-Tree Indexes (Read-Optimized)
B-Trees (Balanced Trees) are the foundational data structure behind most relational databases (PostgreSQL, MySQL, SQLite). A B-Tree is a self-balancing tree where each node can have multiple children:
[Root Node]
[30 | 70]
/ | \
[10|20] [40|60] [80|90]
/ | \ / | \ / | \
[5] [15] [25][35][50][65][75][85][95]
(Leaf nodes contain actual row pointers to disk)
B-Tree Mechanics:
- Each node is sized to match the OS page size (typically 8KB or 16KB).
- Finding a record requires traversing $O(\log_B N)$ nodes, where $B$ is the branching factor and $N$ is the total number of records.
- For 1 billion records with $B=100$: depth = $\log_{100}(10^9) \approx 4.5$ — only 5 disk reads to find any record.
Write Path: Inserting a new value requires finding the correct leaf node and potentially rebalancing the tree (split operations). B-Tree writes are in-place updates — they overwrite existing disk blocks.
Composite Indexes and Index-Only Scans
A composite index on (user_id, created_at) stores keys sorted first by user_id, then by created_at:
-- This query can use the composite index efficiently
SELECT id, total FROM orders
WHERE user_id = 12345
ORDER BY created_at DESC
LIMIT 20;
-- Execution plan: Index Scan using orders_user_created_idx
-- Index Cond: (user_id = 12345)
-- Index pages read: 3 (instead of full 50,000-row table scan)
An Index-Only Scan occurs when the query only needs columns that exist in the index — no heap access required:
-- Composite index on (user_id, status) covers this query entirely
SELECT COUNT(*) FROM orders WHERE user_id = 12345 AND status = 'completed';
-- Index Only Scan: heap fetches = 0
B. LSM-Trees (Write-Optimized)
Log-Structured Merge-Trees (LSM-Trees) are used by write-heavy databases: Apache Cassandra, RocksDB, LevelDB, ClickHouse.
Write Path:
Write Request
|
v
[MemTable] (in-memory sorted buffer)
|
| (when MemTable reaches ~64MB threshold)
v
[SSTable 0] (immutable sorted file on disk)
|
| (background compaction: merge SSTables)
v
[SSTable 1] + [SSTable 2] → [SSTable merged]
- All writes go to an in-memory MemTable (a balanced tree like a Red-Black Tree).
- When the MemTable exceeds a threshold, it is flushed to disk as an immutable SSTable (Sorted String Table).
- Background compaction merges multiple SSTables to reclaim space and remove tombstoned deletes.
Read Path: Must check the MemTable + all SSTables (most recent wins). This is mitigated by Bloom Filters — probabilistic data structures that can definitively say "this key does NOT exist in this SSTable," reducing unnecessary disk reads.
| Feature | B-Tree | LSM-Tree |
|---|---|---|
| Write Performance | Moderate (in-place updates, seek overhead) | Excellent (sequential appends) |
| Read Performance | Excellent (direct lookup via tree traversal) | Good (Bloom filter + compaction) |
| Space Amplification | Low | High (multiple versions until compaction) |
| Write Amplification | Moderate | High (data written multiple times during compaction) |
| Best For | OLTP reads, range queries | High-throughput writes, time-series, logs |
| Examples | PostgreSQL, MySQL, SQLite | Cassandra, RocksDB, LevelDB, ClickHouse |
Section 2: SQL Internals — ACID & Transaction Isolation
A. ACID Properties
Every relational database guarantees ACID properties for transactions:
| Property | Definition | Implementation Mechanism |
|---|---|---|
| Atomicity | All operations in a transaction succeed or all are rolled back | Write-Ahead Log (WAL) |
| Consistency | The database transitions from one valid state to another | Constraints, Foreign Keys, Triggers |
| Isolation | Concurrent transactions do not interfere with each other | MVCC (Multi-Version Concurrency Control) |
| Durability | Committed data survives crashes | WAL flushed to durable storage before commit confirmation |
B. Multi-Version Concurrency Control (MVCC)
PostgreSQL uses MVCC to allow readers and writers to operate concurrently without blocking each other. Instead of locking rows, MVCC stores multiple versions of each row:
Row: user_id=42, name="Alice", xmin=100, xmax=null (current version)
(written by transaction 100)
Transaction 101: UPDATE users SET name="Alicia" WHERE id=42
→ Creates NEW version: name="Alicia", xmin=101, xmax=null
→ Marks OLD version: xmin=100, xmax=101 (now expired)
Concurrent Transaction 99 (started before 101):
→ Reads version where xmin=100 (the old "Alice" row)
→ Does NOT see the "Alicia" update (snapshot isolation)
C. Transaction Isolation Levels
PostgreSQL supports four isolation levels that trade consistency for concurrency:
| Isolation Level | Dirty Read | Non-Repeatable Read | Phantom Read | Use Case |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Analytics (rare) |
| Read Committed | Not possible | Possible | Possible | Default (most OLTP) |
| Repeatable Read | Not possible | Not possible | Possible | Financial reports |
| Serializable | Not possible | Not possible | Not possible | Critical financial transactions |
-- Set isolation level for a transaction
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT balance FROM accounts WHERE id = 42;
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;
Section 3: Query Plan Analysis — EXPLAIN ANALYZE
Before adding indexes or replicas, diagnose slow queries with EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at > NOW() - INTERVAL '24 hours'
ORDER BY o.created_at DESC
LIMIT 100;
Output interpretation:
Sort (cost=1234.56..1235.06 rows=100) (actual time=45.123..45.145 rows=100)
Sort Key: o.created_at DESC
-> Hash Join (cost=450.00..1230.00 rows=200) (actual time=12.345..44.678 rows=100)
Hash Cond: (o.customer_id = c.id)
-> Index Scan on orders_status_created_idx (cost=0.43..780.56 rows=200)
Index Cond: ((status = 'pending') AND (created_at > ...))
-> Hash (cost=200.00..200.00 rows=20000) (actual time=8.456..8.456 rows=20000)
-> Seq Scan on customers (cost=0.00..200.00 rows=20000)
Reading the plan:
Seq Scan on customers— full table scan on 20,000 customer rows. If customers grows to 2 million rows, this becomes the bottleneck.Index Scan on orders_status_created_idx— efficient composite index usage.actual time=45.123ms— total execution time.
Optimization: Add an index on customers.id (it may already be the PK) and ensure the orders composite index on (status, created_at) exists.
CREATE INDEX CONCURRENTLY orders_status_created_idx
ON orders(status, created_at DESC)
WHERE status IN ('pending', 'processing');
-- Partial index: only index relevant statuses, reducing index size by ~80%
Section 4: NoSQL Database Categories
A. Document Databases (MongoDB, CouchDB)
Store data as JSON/BSON documents. Schema-flexible — each document can have different fields.
Best for: Content management, user profiles, catalogs with variable attributes. Query pattern: Rich queries on nested document fields.
// MongoDB: store and query a user profile with variable fields
db.users.insertOne({
_id: "user-123",
name: "Alice Chen",
role: "architect",
certifications: ["MPC-Professional", "AWS-SA"],
preferences: { theme: "dark", notifications: { email: true, push: false } }
});
db.users.find({
"certifications": "MPC-Professional",
"preferences.notifications.email": true
});
Limitation: No multi-document ACID transactions (available in MongoDB 4.0+ but expensive). Poor for complex relational queries with joins.
B. Key-Value Stores (Redis, DynamoDB)
The simplest database model: a hash map. Every value is retrieved by its exact key.
Best for: Session storage, caching, leaderboards, rate limiting.
Query pattern: GET key, SET key value, DEL key. No query language.
# Redis: session storage
SET session:abc123 '{"userId":"user-42","role":"admin"}' EX 3600
GET session:abc123
TTL session:abc123 # → 3599
# Redis Sorted Set: real-time leaderboard
ZADD leaderboard 94850 "alice"
ZADD leaderboard 87340 "bob"
ZREVRANGE leaderboard 0 9 WITHSCORES # Top 10 players
C. Wide-Column Stores (Apache Cassandra, HBase)
Data is organized in tables with rows and columns, but columns are dynamic per row — different rows can have different columns. Designed for massive write throughput and horizontal partitioning.
Best for: Time-series data, IoT sensor streams, user activity logs at petabyte scale.
Cassandra table: sensor_readings
Partition Key: device_id (determines which node stores this data)
Clustering Key: timestamp (sorts within partition)
device_id | timestamp | temperature | battery | location
----------------------------------------------------------------------
sensor-1 | 2024-01-01 12:00:00 | 22.5 | 87% | 37.7,122.4
sensor-1 | 2024-01-01 12:00:01 | 22.6 | 87% | 37.7,122.4
sensor-2 | 2024-01-01 12:00:00 | 19.1 | 43% | 51.5,-0.1
D. Graph Databases (Neo4j, Amazon Neptune)
Stores data as nodes (entities) and edges (relationships). Optimized for traversing highly connected data.
Best for: Social networks (friends-of-friends), recommendation engines, fraud detection, knowledge graphs.
// Neo4j Cypher: find friends-of-friends who share interests
MATCH (user:User {id: 'alice'})-[:FOLLOWS]->(friend:User)-[:FOLLOWS]->(fof:User)
WHERE NOT (user)-[:FOLLOWS]->(fof)
AND (fof)-[:INTERESTED_IN]->(:Topic {name: 'Systems Architecture'})
RETURN fof.name, COUNT(*) as mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10
SQL would require: Multiple self-joins with extremely high complexity for traversal beyond 2–3 hops.
Section 5: Replication Topologies
A. Single-Leader Replication
All writes go to a designated leader node. The leader replicates to follower nodes asynchronously.
graph TD
W[Write Client] -->|Write| Leader[(Leader DB)]
Leader -->|Async WAL stream| F1[(Follower 1)]
Leader -->|Async WAL stream| F2[(Follower 2)]
R1[Read Client A] -->|Read| F1
R2[Read Client B] -->|Read| F2
Failover: If the leader fails, a follower is promoted via a consensus protocol (Raft). This process can take 30–60 seconds and may involve data loss (writes acknowledged by leader but not yet replicated to followers).
Replication Lag Formula: $$\text{Lag} = T_{\text{apply on follower}} - T_{\text{commit on leader}}$$ For busy leaders with high write volume, this lag can reach seconds or minutes, causing stale reads from followers.
B. Multi-Leader Replication
Multiple nodes accept writes, each acting as both a leader and a follower to other leaders.
Data Center A Data Center B
[Leader A] ←————async————→ [Leader B]
↑ ↑
[Local Clients] [Local Clients]
Write Conflict Problem: Alice updates config.maxUsers = 100 on Leader A. Bob simultaneously updates config.maxUsers = 200 on Leader B. When replicated, both leaders have divergent values. Last-Write-Wins (LWW) based on timestamp requires synchronized clocks — dangerous across data centers. See Module 7 (CAP Theorem) for resolution strategies.
C. Leaderless Replication (Dynamo-Style)
No single node is the master. Clients write to and read from multiple nodes in parallel using quorums.
$$W + R > N$$
Where $W$ = write quorum, $R$ = read quorum, $N$ = total replica count.
Example: $N=3, W=2, R=2$: Write must succeed on 2 of 3 replicas. Read must query 2 of 3 replicas. Guaranteed overlap ensures at least one node has the latest write.
| Configuration | Consistency Guarantee | Availability |
|---|---|---|
| $W=3, R=1$ (N=3) | Strong read consistency | Low — all must ack writes |
| $W=1, R=3$ (N=3) | Strong read consistency | Low — all must be queried |
| $W=2, R=2$ (N=3) | Balanced | High — tolerates 1 node failure |
| $W=1, R=1$ (N=3) | No guarantee (eventual) | Very high |
Section 6: Database Sharding
Sharding (horizontal partitioning) distributes rows across multiple independent database nodes. Each node is called a shard.
A. Range-Based Sharding
Shard 1: user_id 1 – 10,000,000
Shard 2: user_id 10,000,001 – 20,000,000
Shard 3: user_id 20,000,001 – 30,000,000
Problem: If most active users have IDs in the 10M–20M range (recently acquired users), Shard 2 receives the majority of traffic while Shards 1 and 3 sit idle. This is called a hot shard.
B. Hash-Based Sharding
shard_id = hash(user_id) % total_shards
hash("user-12345") % 4 = 1 → Shard 1
hash("user-67890") % 4 = 3 → Shard 3
Problem: Adding a new shard requires re-hashing almost all data. Increasing from 4 to 5 shards means: $$\text{Keys remapped} = \frac{N-1}{N} \approx 80%$$
Only ~20% of keys land on the same shard as before. This requires a massive data migration.
C. Consistent Hashing (Elastic Sharding)
Consistent Hashing maps both data keys and server nodes onto a circular ring (0 to $2^{32}$). A key is assigned to the nearest clockwise node:
Key K5 (hash=280)
|
Node C (300) ←--+ ← K5 assigned to Node C
/
Ring: 0 ——— Node A (100) ——— Key K1 (150) ——— Node B (200) ——— Key K3 (250) ——— Node C (300) ——— 360/0
Adding a new node: Only keys between the new node and its predecessor need to move. With $N$ nodes, adding one moves only $\frac{1}{N+1}$ of total keys — a minimal migration.
Virtual Nodes: Each physical server is assigned multiple positions on the ring (virtual nodes), distributing load more evenly and allowing fine-grained rebalancing when nodes join or leave.
Section 7: Connection Pooling with pgBouncer
Every PostgreSQL connection consumes ~5-10MB of server memory and spawns a dedicated OS process. An application server with 100 concurrent queries needs 100 database connections — at 10MB each, that's 1GB of PostgreSQL memory just for connection overhead.
pgBouncer is a connection pooler that maintains a small pool of long-lived database connections, multiplexing thousands of application-side connections:
[App Server 1: 50 worker threads]
[App Server 2: 50 worker threads] → [pgBouncer] → [PostgreSQL: 20 connections]
[App Server 3: 50 worker threads]
150 application connections → multiplexed to → 20 database connections
pgBouncer configuration:
[databases]
macropatterns_prod = host=db.internal port=5432 dbname=mpc_production
[pgbouncer]
pool_mode = transaction # Connection held only during a transaction
max_client_conn = 1000 # Max app-side connections
default_pool_size = 20 # Max backend DB connections
min_pool_size = 5
reserve_pool_size = 5
In transaction pool mode, a connection is returned to the pool the moment a transaction commits, allowing 20 database connections to serve thousands of concurrent requests.
Section 8: Hands-On ERD Challenge
The Challenge
Design an Entity-Relationship Diagram showing a sharded order history database. The diagram must show:
- A
user_shard_maplookup table mappinguser_idtoshard_id. - A
USERStable with ashard_idcolumn for co-location. - An
ORDERStable with a foreign key toUSERSand its ownshard_id. - Crow's foot notation showing cardinality.
Solution Model
erDiagram
USER_SHARD_MAP {
string user_id PK
int shard_id
string region
}
USERS {
string user_id PK
int shard_id FK
string email
string name
string plan
timestamp created_at
}
ORDERS {
string order_id PK
string user_id FK
int shard_id
decimal total
string status
timestamp created_at
}
ORDER_ITEMS {
string item_id PK
string order_id FK
string product_id
int quantity
decimal unit_price
}
USER_SHARD_MAP ||--o{ USERS : "routes to shard"
USERS ||--o{ ORDERS : "places"
ORDERS ||--|{ ORDER_ITEMS : "contains"
Key architectural decisions:
shard_idis denormalized onto bothUSERSandORDERSto enable shard-local JOINs without cross-shard queries.user_shard_mapis a small, globally replicated lookup table (hot data), allowing any application server to route a request to the correct shard.ORDER_ITEMSare co-located on the same shard as their parent order — no cross-shard joins needed.
Bridge to Module 6: Now that you can store data efficiently, you need to protect your database from being hit on every read request. Module 6 (Caching Topologies) covers the layered caching strategies — from browser caches to distributed Redis clusters — that keep your database at acceptable load levels as traffic grows.