Small Data Structures That Lie Just Enough

A database query you can rule out before touching the database is a small victory.

“Have I seen this key before?” “Is this username probably already taken?” “Should I even bother asking the slow database?”

If you insist on exactness, the answer is usually some kind of full set representation: store every key, check membership exactly, pay the memory bill, move on.

But sometimes that is overkill. Sometimes you want a much cheaper front door, something small and fast that can answer:

definitely not or maybe

That is the world of the Bloom filter.¹

The promise

A Bloom filter is a compact probabilistic data structure for set membership. It can say “definitely not in the set” or “possibly in the set.”

The important asymmetry: false positives are possible, but false negatives are not, provided the filter is insert-only and implemented correctly.

If the filter says an item is absent, you can trust that answer. If it says “present,” you may still need a slower exact check behind it. That is why Bloom filters are useful. They do not replace exact storage. They save you from asking exact questions too often.

How it works

Start with an array of m bits, all zero.

To insert an item, hash it k times, getting k positions in the bit array, and set all of those positions to 1.

To query an item, hash it the same k ways and check those positions. If any one of those bits is 0, the item was definitely never inserted. If all k bits are 1, the item may have been inserted.

The structure is tiny because it never stores the items themselves. It stores only a blurry shadow of them.

False-positive rate

If you insert n items into a Bloom filter with m bits and k hash functions, the classic false-positive approximation is

p_{\mathrm{fp}} \approx \left(1 - e^{-kn/m}\right)^k.

That approximation assumes hash outputs behave independently and uniformly. As n grows, the filter saturates. As m grows, collisions become less common. The sweet spot for k sits near $(m/n)\ln 2$ .

At 10 bits per inserted item, the optimum is about 7 hash probes and the false-positive rate is roughly 0.82%. That is the bargain in concrete terms: about 1.25 bytes per key lets the filter avoid more than 99 out of 100 pointless exact lookups for absent keys, assuming the workload matches the sizing estimate.

A tiny implementation

Here is a deliberately small TypeScript Bloom filter. The seeded FNV-style hashes are correlated and the bit array spends one byte per bit, so this is a visualization aid, not a production implementation.

bloom-filter.ts

class BloomFilter {
  private bits: Uint8Array;
  private m: number;
  private k: number;

  constructor(m: number, k: number) {
    this.bits = new Uint8Array(m);
    this.m = m;
    this.k = k;
  }

  private hash(seed: number, value: string): number {
    let h = 2166136261 ^ seed;
    for (let i = 0; i < value.length; i++) {
      h ^= value.charCodeAt(i);
      h = Math.imul(h, 16777619);
    }
    return (h >>> 0) % this.m;
  }

  add(value: string): void {
    for (let i = 0; i < this.k; i++) {
      this.bits[this.hash(i, value)] = 1;
    }
  }

  mightContain(value: string): boolean {
    for (let i = 0; i < this.k; i++) {
      if (this.bits[this.hash(i, value)] === 0) {
        return false;
      }
    }
    return true;
  }
}

Add a bunch of items. Query some inserted keys and some fresh ones. The asymmetry shows up quickly: inserted items come back “maybe” (effectively yes), fresh items mostly come back “definitely not,” and a few fresh items get unlucky and return “maybe.” That is the lie. It is one-sided, and it is useful.

Bloom Filter Explorer

m (bits)

k (hashes)

063

Enter = insert, Shift+Enter = query

items (n)0

bits set0/64 (0%)

theoretical FP0.0%

empirical FP0.0%

Insert words and query them in the widget above. Watch which bits light up on insertion, and how queried positions are highlighted. Increase the number of inserted items and watch the false positive rate climb as the bit array saturates. The trade-off between m, k, and n becomes physical.

Deletion breaks the promise

Deletion breaks the one-sided promise. Clearing a bit for one key can erase evidence shared with another key and create false negatives. Counting Bloom filters solve that with counters instead of bits, at a higher memory cost.

A Bloom filter is not a sample of the set. It is a lossy membership summary with an error budget. Even the cleanest lie needs a maintenance policy.