Choosing Several Things, Not Just One
Series parts
On this page
A raffle with one winner is easy to explain.
A raffle with ten winners is where people start staring at the code.
If the same person wins twice, nobody says “ah yes, independent sampling with replacement.” They say the raffle is broken. And in a sense, they are right. The problem has changed.
Sampling one item from a finite set is one kind of randomness. Sampling several distinct items is another. The difference sounds small in English, but it changes both the math and the implementation.
With replacement vs without replacement
Suppose your set is .
If you sample with replacement, you draw one item, record it, then put it back before the next draw. Every draw sees the same full set. The sequence is perfectly possible.
If you sample without replacement, you draw one item and remove it from the available pool. After drawing , the next draw is from , and is impossible.
That one change means the draws are no longer independent. Each choice affects the future choices.
What uniform without replacement means
If you sample items uniformly without replacement from a set of size , the cleanest definition is:
Every subset of size should be equally likely.
Sampling 2 items from yields 6 possible unordered outcomes: , , , , , . A uniform without-replacement sampler should give each subset probability .
A useful fact drops out immediately:
Choose 3 distinct items from 10, and each individual item has inclusion probability . That does not mean the events are independent. If one item is already in your sample, there is slightly less room for the others.
Why this shows up all over software
Without-replacement sampling appears whenever duplicates are wasteful, impossible, or misleading: choosing a mini-batch of distinct examples, picking several reviewers from a team, selecting unique experiment units, choosing a subset of candidate moves, sampling rows for inspection, shuffling a playlist without immediate repeats.
Uniform sampling gives you one fair choice. Without-replacement sampling gives you a fair subset.
A simple implementation
The most straightforward way to sample k items without replacement is to partially shuffle the array and keep the first k elements. Here is a TypeScript version using the Fisher-Yates idea:
export function sampleWithoutReplacement<T>( xs: readonly T[], k: number): T[] { if (k < 0 || k > xs.length) { throw new Error("Invalid sample size"); }
const ys = [...xs];
for (let i = 0; i < k; i++) { const j = i + Math.floor(Math.random() * (ys.length - i)); [ys[i], ys[j]] = [ys[j], ys[i]]; }
return ys.slice(0, k);}Each iteration picks one uniformly random element from the remaining suffix and moves it into position. If you shuffled the whole array and returned all of it, that would be an ordinary random shuffle. Here we only do the first k steps, because that is all we need.
That is why the partial shuffle fits this problem so well: it gives you k distinct random items without paying for a full shuffle.
A tempting but awkward alternative
A common first attempt: repeatedly draw uniformly with replacement, insert results into a Set, stop when the set has size k.
This eventually works, but it has a strange runtime profile. When k gets close to n, duplicates become frequent and progress slows. The algorithm spends time proposing outcomes it cannot use, then throwing them away.
The partial-shuffle approach is cleaner because it samples from the right space directly.
A change in perspective
In the first two posts, a probability distribution lived over individual outcomes. Here, the probability distribution lives over subsets.
That shift keeps happening in software: sometimes we choose one item, sometimes many, sometimes order matters, sometimes it does not, sometimes repetition is allowed, sometimes repetition breaks the whole point. Good randomized algorithms are often just careful answers to those questions.
Without-replacement sampling assumes the full set is known in advance. But what if items keep arriving and you do not know when they will stop? You want a fair sample anyway, and you have fixed memory. That is the next post.