How to Keep a Fair Sample of a Stream
Series parts
On this page
A list is polite.
A list sits there and lets you count it. You can ask how long it is. You can shuffle it. You can index into it.
A stream is not like that.
A stream keeps arriving after you have run out of memory. Logs keep coming in. Events keep firing. Requests keep landing. You do not get to pause the world, count everything, and then decide what your sample should have been.
And yet the question remains perfectly reasonable:
Can I keep a fair sample of everything I have seen so far, without storing everything?
Yes. The algorithm is called reservoir sampling.
The problem in one sentence
Items arrive one by one: . You do not know how many there will be. You want to keep exactly k of them in memory, so that after seeing n items, every item seen so far has probability
If you have seen 10,000 log lines and your reservoir size is 100, each line should have a 1% chance of being in memory right now. Not just the recent ones. Not just the lucky early ones. All of them.
The algorithm
The simplest version is often called Algorithm R.
- Put the first
kitems directly into the reservoir. - When the -st item arrives, keep it with probability .
- If you keep it, evict one of the current
kreservoir items uniformly at random.
That is the whole algorithm.
export function reservoirSample<T>( stream: Iterable<T>, k: number): T[] { const reservoir: T[] = []; let seen = 0;
for (const item of stream) { seen += 1;
if (reservoir.length < k) { reservoir.push(item); continue; }
const j = Math.floor(Math.random() * seen);
if (j < k) { reservoir[j] = item; } }
return reservoir;}The key line is:
const j = Math.floor(Math.random() * seen);If j < k, the new item enters the reservoir. Otherwise it is ignored. That single line silently implements the “keep with probability / seen” step.
Why it works
The proof is a short induction.
Take the moment when you have seen items and your reservoir is already fair. Now item arrives. It should be included with probability , and that is exactly what the algorithm does.
What about an older item already in the reservoir? It was there with probability . Once the new item arrives, it survives unless two things happen: the new item is accepted (probability ), and this specific old item is chosen for eviction (probability ).
The probability of getting kicked out is
The survival probability is , so the old item remains with probability
The old items and the new item all end up with the same inclusion probability.
Step through items one at a time above, or let it auto-play. Watch the reservoir fill up greedily at first, then become increasingly selective. Switch to the fairness view and run thousands of simulations; the inclusion histogram flattens, proving every position has equal probability.
Why this matters
Reservoir sampling is the first algorithm in this series that feels unmistakably like software rather than textbook probability. It solves a constraint you actually run into: you cannot store the whole stream, you do not know how long it will be, and you still want a sample you can defend.
That comes up in telemetry, observability, online analytics, rate-limited inspection, and any long-running process where “just keep everything” is not really a plan.
The algorithm starts greedy, then becomes increasingly selective. Early items enter easily because the reservoir is still filling. Later items enter only with small probability, and even then they rarely displace older items.
A reservoir is fair, not balanced
Reservoir sampling is not trying to make the buffer look aesthetically balanced, upweight rare classes, preserve recent items, or track concept drift. Its job is narrower: maintain a uniform sample without replacement from a stream of unknown length.
If your only goal is “give me a representative size- sample of an unknown-length stream,” that neutrality is exactly the point.
When fairness is not the right goal
In continual learning, a replay buffer may not want to mirror the raw stream exactly. If the stream is highly imbalanced or temporally correlated, a plain uniform reservoir preserves those distortions.
A modern example is Kullback-Leibler Reservoir Sampling (KLRS), which updates the replay buffer so that its class proportions stay close to a chosen target distribution instead of merely copying the incoming stream.
Classic reservoir sampling asks for fairness with respect to the stream. KLRS asks for closeness to a target distribution. That is a natural next question once you understand the baseline.
Reservoir sampling is the clean baseline for streaming fairness. But the moment your storage engine has other goals, the word “random” stops being the whole story. That tension gets sharper in databases, where “give me random rows” has to negotiate with pages, scans, and cost models. That is the next post.