Skip to main content

Small Data Structures That Lie Just Enough

· 4 min read
Series parts
  1. Part 7Small Data Structures That Lie Just Enough
On this page

A lot of software questions sound more precise than they really are.

“Have I seen this key before?” “Is this username probably already taken?” “Should I even bother asking the slow database?”

If you insist on exactness, the answer is usually some kind of full set representation: store every key, check membership exactly, pay the memory bill, move on.

But sometimes that is overkill. Sometimes you want a much cheaper front door, something small and fast that can answer:

definitely not or maybe

That is the world of the Bloom filter.

The promise

A Bloom filter is a compact probabilistic data structure for set membership. It can say “definitely not in the set” or “possibly in the set.”

The important asymmetry: false positives are possible, but false negatives are not.

If the filter says an item is absent, you can trust that answer. If it says “present,” you may still need a slower exact check behind it. That is why Bloom filters are useful. They do not replace exact storage. They save you from asking exact questions too often.

How it works

Start with an array of m bits, all zero.

To insert an item, hash it k times, getting k positions in the bit array, and set all of those positions to 1.

To query an item, hash it the same k ways and check those positions. If any one of those bits is 0, the item was definitely never inserted. If all k bits are 1, the item may have been inserted.

That is the whole mechanism. The structure is tiny because it never stores the items themselves. It stores only a blurry shadow of them.

False-positive rate

If you insert n items into a Bloom filter with m bits and k hash functions, the classic false-positive approximation is

pfp(1ekn/m)k.p_{\mathrm{fp}} \approx \left(1 - e^{-kn/m}\right)^k.

That is the main trade-off. As n grows, the filter saturates. As m grows, collisions become less common. As k changes, you shift the balance between underusing and overpainting the bit array. There is even a sweet spot for k, roughly near (m/n)ln2(m/n)\ln 2.

The formula makes the engineering trade-off explicit: memory, workload, and error rate move together.

A tiny implementation

Here is a deliberately small TypeScript Bloom filter. The hashes are toy-quality, which is fine for a demo and terrible for production.

class BloomFilter {
private bits: Uint8Array;
private m: number;
private k: number;
constructor(m: number, k: number) {
this.bits = new Uint8Array(m);
this.m = m;
this.k = k;
}
private hash(seed: number, value: string): number {
let h = 2166136261 ^ seed;
for (let i = 0; i < value.length; i++) {
h ^= value.charCodeAt(i);
h = Math.imul(h, 16777619);
}
return (h >>> 0) % this.m;
}
add(value: string): void {
for (let i = 0; i < this.k; i++) {
this.bits[this.hash(i, value)] = 1;
}
}
mightContain(value: string): boolean {
for (let i = 0; i < this.k; i++) {
if (this.bits[this.hash(i, value)] === 0) {
return false;
}
}
return true;
}
}

Add a bunch of items. Query some inserted keys and some fresh ones. The asymmetry shows up quickly: inserted items come back “maybe” (effectively yes), fresh items mostly come back “definitely not,” and a few fresh items get unlucky and return “maybe.” That is the lie. It is one-sided, and it is useful.

Bloom Filter Explorer
m (bits)
k (hashes)
063
items (n)0
bits set0/64 (0%)
theoretical FP0.0%
empirical FP0.0%

Insert words and query them in the widget above. Watch which bits light up on insertion, and how queried positions are highlighted. Increase the number of inserted items and watch the false positive rate climb as the bit array saturates. The trade-off between m, k, and n becomes physical.

Why Bloom filters are useful

Bloom filters embody a simple compromise: use a cheap probabilistic pre-check before a slower exact lookup.

That shows up everywhere: front-ending a slow database lookup, filtering repeated crawl URLs, reducing unnecessary disk probes, screening candidate joins, avoiding work on definitely unseen items.

A Bloom filter is not a statistical sample of the set. It is a probabilistic summary of it. It fits this series for the same reason the earlier posts do: you give up a small amount of certainty to buy a large amount of efficiency.


Up to now, randomness has helped us represent data: fair samples, streams, sets. Next, it starts helping us search and compute. Local optima, random restarts, randomized graph algorithms, cheap stochastic steps. That is the next post.