Blog/Database Indexing Internals Every Architect Must Know

databaseinternalsindexingperformance

Database Indexing Internals Every Architect Must Know

January 30, 2024·15 min read·by Bishwambhar Sen

A comparison chart outlining B-Tree node branching structures and LSM-Tree write paths (Memtable to SSTable tiering).

Concept

At its core, a database index is a data structure designed to minimize the number of disk pages that must be read from non-volatile storage to satisfy a query. Without an index, a database engine must perform a sequential scan, reading every page of a table from disk into memory to evaluate filter conditions.

For database systems at scale, disk I/O is the primary bottleneck. To design high-performance systems, we must understand the data structures that underpin indexing: B-Trees and Log-Structured Merge (LSM) Trees.

1. B-Tree Indexes: The Read-Optimized Standard

B-Trees (specifically B+ Trees in modern relational databases like PostgreSQL, MySQL, and SQL Server) are self-balancing search trees designed to work efficiently on block storage.

Unlike binary search trees, B-Trees are highly branched. A single node (page) can have hundreds of children. This high fan-out minimizes the depth of the tree. A B-Tree indexing millions of rows typically has a depth of only 3 to 4 levels.

                  [ Root Node: 50 | 100 ]
                 /           |           \
  [ Leaf: 1..49 ]     [ Leaf: 51..99 ]    [ Leaf: 101..* ]

Key features of B+ Trees include:

Internal Nodes: Contain only search keys and child pointers. They act as routers.
Leaf Nodes: Contain the actual keys and either the corresponding data rows or pointers to them (depending on whether the index is clustered). Crucially, leaf nodes are linked sequentially in a doubly linked list, enabling highly efficient range scans.
Fixed Page Size: Database engines read and write to disk in fixed units called pages (typically 8KB in PostgreSQL, 16KB in MySQL InnoDB). Every node in a B-Tree fits into a page.

2. LSM-Tree Indexes: The Write-Optimized Alternative

While B-Trees excel at reads, they perform poorly under write-heavy workloads because they require random in-place updates. A single row update in a B-Tree requires locating the leaf node page, updating the value in memory, and eventually flushing that entire page back to disk.

LSM-Trees (used in Cassandra, RocksDB, and ScyllaDB) solve this by trading read performance for write throughput. Instead of updating data in-place on disk, LSM-Trees append all writes to a sequential log.

Write ──> [ Write-Ahead Log (WAL) ] (Disk append)
      └──> [ Memtable ] (RAM, Sorted)
             │ (Flush when full)
             ▼
      [ SSTables Level 0 ] ──(Compaction)──> [ SSTables Level 1 ]

An LSM-tree write path operates as follows:

Write-Ahead Log (WAL): The write is appended to a log on disk for durability.
Memtable: The write is added to an in-memory, sorted structure (usually a Red-Black tree or Skip List). The write is now complete from the client's perspective.
SSTable Flush: When the Memtable reaches a size limit (e.g., 64MB), it is written to disk as a sorted string table (SSTable). SSTables are immutable.
Compaction: Because SSTables are immutable, duplicate updates to the same key accumulate across different SSTable files. A background process called compaction reads multiple SSTables, merges them, retains only the latest version of each key, and writes a new, consolidated SSTable, discarding the old ones.

Constraints

Architects must navigate physical hardware limitations and algorithmic constraints when designing database schemas.

1. The RUM Conjecture

The RUM Conjecture states that a database engine can optimize for at most two of the following three properties: Read Overhead, Update Overhead, or Memory/Space Overhead.

B-Trees: Optimize for Reads (low read amplification) and Space (minimal duplication), but suffer on Updates (high random write overhead).
LSM-Trees: Optimize for Updates (sequential writes) and Space (when compacted), but suffer on Reads (high read amplification, as a read may have to inspect multiple SSTables).

2. Page Splits and Index Bloat

B-Tree pages are structured to allow room for future inserts. When a page becomes full (100% capacity) and a new key must be inserted, the database engine performs a page split. It allocates a new page, moves half of the rows from the old page to the new page, and inserts the new key.

The Constraint: If keys are inserted in random order (such as UUID v4), page splits occur frequently, leading to fragmented pages that are only 50% full. This causes index bloat, wasting memory in the buffer pool and increasing disk reads.
Architectural Mitigation: Use sequential keys (like Auto-Incrementing IDs, UUID v7, or Snowflake IDs) to ensure inserts always target the rightmost leaf page, avoiding page splits and maintaining high page fill factors.

3. Write Amplification Factor (WAF)

Write amplification occurs when the amount of data written to physical storage is a multiple of the logical data written by the application. $$\text{WAF} = \frac{\text{Bytes Written to Storage}}{\text{Bytes Written by Application}}$$ In a B-Tree, updating a single 100-byte row requires writing the entire 8KB page to disk, resulting in a WAF of 80. In an LSM-Tree, while writes to the Memtable are fast, background compaction repeatedly reads and rewrites SSTables. Under heavy write loads, LSM compaction can saturate disk bandwidth, causing latency spikes.

Trade-offs

1. Clustered vs. Non-Clustered Indexes

Clustered Index: The leaf nodes of the B-Tree contain the actual data rows of the table. A table can have only one clustered index (usually the Primary Key).
- Trade-off: Fast primary key lookups and range scans. However, secondary indexes become slower because their leaf nodes store the clustered index key. A secondary index lookup requires traversing the secondary B-Tree, retrieving the primary key, and then traversing the clustered B-Tree (a "key lookup" or "bookmark lookup").
Non-Clustered Index: The leaf nodes contain pointers to the physical location of the rows on disk (row IDs).
- Trade-off: Faster writes, but reads are slower if they require fetching non-indexed columns, as this involves random page lookups on disk.

2. Covering Indexes vs. Write Overhead

A covering index is a secondary index that includes all columns referenced by a query (either as index keys or using an INCLUDE clause). This allows the database engine to satisfy the query entirely from the index page, bypassing the key lookup step.

Trade-off: Improves read performance from O(log N) + O(disk lookup) to O(log N). However, every additional column in the index increases write latency and storage footprint. If a column is frequently updated, the database must update both the table and the index, increasing lock contention.

Code

To understand indexing data structures, let's explore two implementations: a B-Tree node search algorithm that routes lookups using page pointers, and an LSM-Tree Memtable flush emulator demonstrating sequential SSTable creation.

1. B-Tree Page Routing Simulation

This C# class demonstrates how a database engine traverses internal B-Tree pages to find a specific key at the leaf level.

using System;
using System.Collections.Generic;

namespace Mpc.Database.Internals
{
    public class BTreeRouter
    {
        private const int PageSizeKeys = 3; // Fan-out simulation limit

        public class PageNode
        {
            public bool IsLeaf { get; set; }
            public List<int> Keys { get; set; } = new();
            public List<PageNode> ChildPointers { get; set; } = new(); // Null for leaf nodes
            public PageNode NextPage { get; set; } // Link for range scans on leaf level
        }

        public PageNode Root { get; set; }

        public BTreeRouter()
        {
            // Seed a simple 2-level B-Tree
            var leaf1 = new PageNode { IsLeaf = true, Keys = new List<int> { 10, 20 } };
            var leaf2 = new PageNode { IsLeaf = true, Keys = new List<int> { 30, 40 } };
            var leaf3 = new PageNode { IsLeaf = true, Keys = new List<int> { 50, 60 } };
            leaf1.NextPage = leaf2;
            leaf2.NextPage = leaf3;

            Root = new PageNode
            {
                IsLeaf = false,
                Keys = new List<int> { 25, 45 },
                ChildPointers = new List<PageNode> { leaf1, leaf2, leaf3 }
            };
        }

        public PageNode SearchLeafPage(int key, out int diskReadOperations)
        {
            diskReadOperations = 1; // Start with root page read
            var current = Root;

            while (!current.IsLeaf)
            {
                int i = 0;
                while (i < current.Keys.Count && key >= current.Keys[i])
                {
                    i++;
                }

                current = current.ChildPointers[i];
                diskReadOperations++; // Reading next page node from disk
            }

            return current;
        }
    }

    public class BTreeRunner
    {
        public static void Run()
        {
            var btree = new BTreeRouter();
            var targetLeaf = btree.SearchLeafPage(35, out int diskReads);
            Console.WriteLine($"Target key found in leaf page containing keys: {string.Join(", ", targetLeaf.Keys)}");
            Console.WriteLine($"Total disk page reads: {diskReads}"); // Output: 2
        }
    }
}

2. LSM-Tree Memtable and Sorted String Table (SSTable) Flush Simulation

This code emulates the write path of an LSM-Tree, sorting records in memory and flushing them sequentially to disk as immutable, sorted string tables.

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text.Json;

namespace Mpc.Database.Internals.Lsm
{
    public class LSMEngine
    {
        private readonly string _dataDir;
        private readonly int _memtableThresholdBytes;
        private ConcurrentDictionary<string, string> _memtable;
        private int _currentSSTableIndex = 0;

        public LSMEngine(string dataDir, int memtableThresholdBytes = 1024)
        {
            _dataDir = dataDir;
            _memtableThresholdBytes = memtableThresholdBytes;
            _memtable = new ConcurrentDictionary<string, string>();
            Directory.CreateDirectory(_dataDir);
        }

        public void Put(string key, string value)
        {
            // 1. Write ahead log entry would happen here
            _memtable[key] = value;

            // Simple size calculation approximation
            int estimatedSize = _memtable.Sum(x => x.Key.Length + x.Value.Length);
            if (estimatedSize >= _memtableThresholdBytes)
            {
                FlushMemtableToSSTable();
            }
        }

        private synchronized void FlushMemtableToSSTable()
        {
            if (_memtable.IsEmpty) return;

            // Capture snapshot of current memtable
            var snapshot = _memtable.OrderBy(x => x.Key).ToList();
            _memtable.Clear();

            string sstablePath = Path.Combine(_dataDir, $"sstable_{_currentSSTableIndex++}.db");

            using (var stream = new FileStream(sstablePath, FileMode.Create, FileAccess.Write))
            using (var writer = new StreamWriter(stream))
            {
                foreach (var kvp in snapshot)
                {
                    // Write data sequentially in sorted order
                    writer.WriteLine($"{kvp.Key}:{kvp.Value}");
                }
            }

            Console.WriteLine($"Memtable flushed successfully. Immutable SSTable created: {sstablePath}");
        }
    }
}