B-tree

A B-tree is a self-balancing search tree where each node holds many keys and has many children, not just two. It generalises the Binary search tree from “two children per node” to “up to $m$ children per node,” which keeps the tree shallow even for huge data sets. B-trees are the structure behind almost every database index and filesystem directory on disk.

Image: Example B-tree, CC BY-SA 3.0

The motivation is simple: a Red-black tree with $1 0^{9}$ keys is around 30 levels deep on average (worst-case $2 lo g_{2} (n + 1) \approx 60$ — RB trees can be up to twice as tall as a perfectly balanced tree). That’s tens of random memory reads to find a key. On disk, where each read is a millisecond, that’s roughly 30 ms typical, up to 60 ms worst case. A B-tree of order 100 with the same number of keys is only $lo g_{100} 1 0^{9} \approx 4.5$ levels deep — fewer than five reads. The point of B-trees is to minimise the number of node fetches, which matters whenever fetching a node is expensive (disk, network, cache miss).

Structure

A B-tree of order $m$ (the maximum number of children) satisfies:

Every node holds between $⌈ m /2 ⌉ - 1$ and $m - 1$ keys (root may have as few as 1).
A node with $k$ keys has exactly $k + 1$ children.
The keys in a node are sorted: $k_{1} < k_{2} < \dots < k_{k}$ .
The subtree between $k_{i}$ and $k_{i + 1}$ contains only keys in that interval. (Same ordering invariant as a BST, generalised.)
All leaves are at the same depth.

Property 5 — every leaf at the same depth — is what keeps the tree balanced. Insertions and deletions enforce this by splitting and merging nodes, never by rotating.

Search

To find a key, walk down from the root. At each node, do a (binary or linear) search through the node’s keys to either find the key or pick the right child to descend into. The cost is $O (lo g_{m} n)$ comparisons across $O (lo g_{m} n)$ node fetches — the comparisons are cheap, the fetches are the expensive part.

Insertion

Always insert into a leaf:

Search down to the leaf where the new key belongs.
If the leaf has room ( $< m - 1$ keys), insert in sorted order. Done.
Otherwise the leaf overflows. Split it: take the median key, push it up into the parent, and split the remaining keys into two new leaves.
If pushing up makes the parent overflow, split the parent too. Recurse.
If the root overflows, create a new root with one key — the only operation that increases the tree’s height.

Splitting is the symmetric counterpart of rotation in a BST: it’s the local fix that maintains the invariant.

Deletion

More fiddly than insertion. The cases:

If the key is in a leaf and the leaf has more than the minimum number of keys, just remove it.
If the key is in an internal node, replace it with its in-order successor (the smallest key in the subtree to its right), then delete the successor from its leaf.
If a leaf would underflow (drop below $⌈ m /2 ⌉ - 1$ keys):
- Borrow from a sibling that has more than the minimum (rotate one key through the parent).
- If no sibling can spare a key, merge with a sibling and pull the separator key down from the parent.
Merging may underflow the parent — recurse upward. If the root ends up empty, the tree shrinks by one level.

B+ tree variant

Most databases use the B+ tree, a variant where:

Internal nodes hold only keys (no data, just routing information).
All actual data lives in the leaves.
Leaves are linked into a sorted doubly-linked list.

The leaf list makes range queries fast: find the lower bound, then walk the list. SQL WHERE x BETWEEN a AND b is exactly this. Filesystem directory listings work the same way.

Why B-trees dominate disk storage

The math: a disk read is roughly $1 0^{5}$ times slower than a memory access. So you want every disk read to do as much work as possible. A B-tree node is sized to fit a disk page (typically 4 KB or 8 KB), packing dozens or hundreds of keys per fetch. Fewer fetches per lookup means much faster queries.

Concrete examples:

PostgreSQL, MySQL InnoDB, SQL Server: indexes are B+ trees.
Linux ext4, NTFS, HFS+: directory indexes are B-trees.
SQLite, BerkeleyDB: full database stored as a B-tree.

For an in-memory balanced search tree, see Red-black tree or AVL tree — those are the right choice when every “fetch” is just a pointer dereference. B-trees are specifically for the case where fetches are expensive.

Idriss Rami — Notes

Explorer