Linear probing

Linear probing is the simplest collision-resolution scheme for an open-addressed Hash table. When the bucket $h (k)$ is occupied by a different key, probe the next bucket, then the next, and so on:

$index_{i} = (h (k) + i) mod M, i = 0, 1, 2, \dots$

Walk linearly until you find either the target key (lookup hit), an empty bucket (lookup miss / insertion slot), or a tombstone (slot vacated by a delete).

The probe sequence for any key $k$ is therefore $h (k), h (k) + 1, h (k) + 2, \dots$ — fully determined by the initial hash. Two keys with the same initial hash share the entire probe sequence; two keys with adjacent initial hashes overlap on every subsequent step.

Why it’s simple

Linear probing fits the cache extremely well. Probing the next bucket touches the next memory cell, which is almost certainly already in cache after the first probe (cache lines hold 8–16 hash-table slots). So even a 5-probe lookup costs essentially one memory access.

This cache behaviour is why modern hash tables — including Google’s SwissTable, Rust’s HashMap, Robin Hood hashing — use linear probing under the hood despite its theoretical clustering issue.

Primary clustering

The downside is primary clustering: contiguous runs of occupied buckets that grow over time.

Mechanism: any new insertion that hashes to any bucket within an existing run collides and ends up at the run’s tail, extending it by one. Longer runs get more new arrivals (more buckets means more $h (k)$ values land in the run). Runs grow superlinearly, lookups spend more time scanning, and performance degrades sharply as the load factor rises.

This is why the analytical probe count for linear probing is much worse than for “ideal” uniform hashing.

Probe-count formulas

Let $α = n / M$ be the load factor (items / buckets). Knuth’s analysis gives:

Unsuccessful search / insertion (average probes): $\frac{1}{2} (1 + \frac{1}{( 1 - α ) ^{2}})$ .
Successful search (average probes): $\frac{1}{2} (1 + \frac{1}{1 - α})$ .

Compare with uniform hashing:

Unsuccessful: $\frac{1}{1 - α}$ .
Successful: $\frac{1}{α} ln \frac{1}{1 - α}$ .

At $α = 0.5$ :

Linear: 2.5 unsuccessful, 1.5 successful.
Uniform: 2 unsuccessful, 1.39 successful.

At $α = 0.9$ :

Linear: 50.5 unsuccessful, 5.5 successful.
Uniform: 10 unsuccessful, 2.56 successful.

The asymptotic blow-up in the unsuccessful case is dramatic: $1/ (1 - α)^{2}$ vs $1/ (1 - α)$ . Linear probing degrades fast as $α \to 1$ .

Practical load-factor limit

To keep the unsuccessful-probe count under, say, 5, linear probing needs $α < 0.7$ ish. Rust’s standard HashMap resizes at $α = 7/8 = 0.875$ (with Robin Hood adjustments to dampen primary clustering); the SwissTable in Google’s Abseil resizes at $α = 7/8$ similarly.

Older hash table implementations targeting $α \leq 0.5$ — half the table empty — were a common over-correction.

Deletion and tombstones

Naively setting a deleted slot to NULL breaks the probe chain. Subsequent lookups for keys further along the probe sequence would stop at the NULL slot, missing those keys.

The standard fix is a tombstone: a sentinel value marking “this slot was occupied, but the key was deleted.” Lookups skip past tombstones; insertions can reuse them.

Initial:        [A][B][C][_][_][_]   (3 keys, 3 empty)
delete(B):      [A][T][C][_][_][_]   (T = tombstone)
search(C):      check 0 (A, no match) → check 1 (tombstone, skip) → check 2 (C, match!)
insert(D)→1:    [A][D][C][_][_][_]   (tombstone overwritten)

Tombstones accumulate over time and slow lookups, so periodic rehash compaction is needed if insert/delete patterns leave many.

Linear probing variants

Two refinements that keep linear probing’s cache behaviour while reducing clustering:

Robin Hood hashing. When inserting, if the bucket is occupied by a key with a smaller probe distance than your current one, swap with that key and continue probing for the displaced one. The result is that all probe distances stay close to each other — no key suffers a much longer probe than its peers, evening out the worst case.
Hopscotch hashing. Each bucket has a neighbourhood (typically 32 slots wide) within which its key must live. Inserts may shuffle nearby keys to maintain the neighbourhood invariant. Lookup is bounded to a single cache-line scan.

These variants are what production-grade hash tables (Folly, Abseil, Java’s HashMap with treeified buckets) actually use.

In context

Linear probing is one of three open-addressing schemes; the other two — Quadratic probing (steps of $i^{2}$ ) and Double hashing (step depends on a second hash) — try to spread out probes to reduce clustering at the cost of worse cache behaviour.

The alternative paradigm is closed-address Separate chaining, where each bucket holds a linked list. Closed-address is simpler to implement but has worse cache behaviour than linear probing.

For the parent comparison and the analyses of all four schemes side by side, see Hash table.

Idriss Rami — Notes

Explorer