Skip to main content

How to Keep a Fair Sample of a Stream

· 5 min read
Series parts
  1. Part 4How to Keep a Fair Sample of a Stream
On this page

A list is polite.

A list sits there and lets you count it. You can ask how long it is. You can shuffle it. You can index into it.

A stream is not like that.

A stream keeps arriving after you have run out of memory. Logs keep coming in. Events keep firing. Requests keep landing. You do not get to pause the world, count everything, and then decide what your sample should have been.

And yet the question remains perfectly reasonable:

Can I keep a fair sample of everything I have seen so far, without storing everything?

Yes. The algorithm is called reservoir sampling.

The problem in one sentence

Items arrive one by one: x1,x2,x3,x_1, x_2, x_3, \dots. You do not know how many there will be. You want to keep exactly k of them in memory, so that after seeing n items, every item seen so far has probability

P(item i is in the reservoir)=kn.P(\text{item } i \text{ is in the reservoir}) = \frac{k}{n}.

If you have seen 10,000 log lines and your reservoir size is 100, each line should have a 1% chance of being in memory right now. Not just the recent ones. Not just the lucky early ones. All of them.

The algorithm

The simplest version is often called Algorithm R.

  1. Put the first k items directly into the reservoir.
  2. When the (t+1)(t+1)-st item arrives, keep it with probability k/(t+1)k / (t+1).
  3. If you keep it, evict one of the current k reservoir items uniformly at random.

That is the whole algorithm.

export function reservoirSample<T>(
stream: Iterable<T>,
k: number
): T[] {
const reservoir: T[] = [];
let seen = 0;
for (const item of stream) {
seen += 1;
if (reservoir.length < k) {
reservoir.push(item);
continue;
}
const j = Math.floor(Math.random() * seen);
if (j < k) {
reservoir[j] = item;
}
}
return reservoir;
}

The key line is:

const j = Math.floor(Math.random() * seen);

If j < k, the new item enters the reservoir. Otherwise it is ignored. That single line silently implements the “keep with probability kk / seen” step.

Why it works

The proof is a short induction.

Take the moment when you have seen tt items and your reservoir is already fair. Now item t+1t+1 arrives. It should be included with probability k/(t+1)k/(t+1), and that is exactly what the algorithm does.

What about an older item already in the reservoir? It was there with probability k/tk/t. Once the new item arrives, it survives unless two things happen: the new item is accepted (probability k/(t+1)k/(t+1)), and this specific old item is chosen for eviction (probability 1/k1/k).

The probability of getting kicked out is

kt+11k=1t+1.\frac{k}{t+1}\cdot\frac{1}{k}=\frac{1}{t+1}.

The survival probability is t/(t+1)t/(t+1), so the old item remains with probability

kttt+1=kt+1.\frac{k}{t}\cdot\frac{t}{t+1}=\frac{k}{t+1}.

The old items and the new item all end up with the same inclusion probability.

Reservoir Sampling
Reservoir (k = 5)
slot 0
slot 1
slot 2
slot 3
slot 4
Click "Step" to begin streaming items into the reservoir.
k
Seen: 0 / 200 items

Step through items one at a time above, or let it auto-play. Watch the reservoir fill up greedily at first, then become increasingly selective. Switch to the fairness view and run thousands of simulations; the inclusion histogram flattens, proving every position has equal probability.

Why this matters

Reservoir sampling is the first algorithm in this series that feels unmistakably like software rather than textbook probability. It solves a constraint you actually run into: you cannot store the whole stream, you do not know how long it will be, and you still want a sample you can defend.

That comes up in telemetry, observability, online analytics, rate-limited inspection, and any long-running process where “just keep everything” is not really a plan.

The algorithm starts greedy, then becomes increasingly selective. Early items enter easily because the reservoir is still filling. Later items enter only with small probability, and even then they rarely displace older items.

A reservoir is fair, not balanced

Reservoir sampling is not trying to make the buffer look aesthetically balanced, upweight rare classes, preserve recent items, or track concept drift. Its job is narrower: maintain a uniform sample without replacement from a stream of unknown length.

If your only goal is “give me a representative size-kk sample of an unknown-length stream,” that neutrality is exactly the point.

When fairness is not the right goal

In continual learning, a replay buffer may not want to mirror the raw stream exactly. If the stream is highly imbalanced or temporally correlated, a plain uniform reservoir preserves those distortions.

A modern example is Kullback-Leibler Reservoir Sampling (KLRS), which updates the replay buffer so that its class proportions stay close to a chosen target distribution instead of merely copying the incoming stream.

Classic reservoir sampling asks for fairness with respect to the stream. KLRS asks for closeness to a target distribution. That is a natural next question once you understand the baseline.


Reservoir sampling is the clean baseline for streaming fairness. But the moment your storage engine has other goals, the word “random” stops being the whole story. That tension gets sharper in databases, where “give me random rows” has to negotiate with pages, scans, and cost models. That is the next post.