How Much Can You Trust a Sample?
Series parts
On this page
At some point in every conversation about sampling, someone asks the harder question.
Not “how do we pick the rows?” Not “how fast is the query?” Not “can we make this reproducible?”
They ask:
Sure, but how wrong could we be?
That is the right question.
Up to now, the series has mostly been about how to choose fairly: from a set, from a weighted distribution, from a stream, from a table. But as soon as a sample becomes evidence for some larger claim, fairness is only half the story. The other half is reliability.
If I sample 1,000 events and 61% of them have a property, how much confidence should I have that the full population is also near 61%?
This is where concentration bounds enter. And among them, Chernoff bounds are some of the most useful.
The setup
Imagine n independent yes/no random variables , where each is 1 if something happened and 0 otherwise.
This is the right model for many software questions: did this request fail? Did this row match the predicate? Did this randomized test pass? Did this user click?
Let be the total number of successes, and let .
Now the question becomes: how likely is it that strays far from ?
A useful Chernoff bound
There are many forms of Chernoff bounds. A compact two-sided version is:
This formula says something simple and powerful: if your random variable is a sum of many independent tiny contributions, then large relative deviations become exponentially unlikely.
The decay is exponential in , not just qualitatively small. The more expected mass you have, the tighter the concentration gets.
What it means in plain English
Suppose your sample is large enough that the expected number of positives is . Suppose you care about relative error , meaning “being off by 10% or more.”
Then the bound says
That is already fairly small. And as grows, it drops quickly.
The important idea is not the exact constant . The important idea is the shape: more independent evidence, smaller relative fluctuations, exponentially better tail behavior. That is why sampling can scale from “toy estimate” to “defensible approximation.”
From intuition to a guarantee
Without a concentration bound, sampling often gets explained vaguely: “it should be close,” “usually this works,” “the sample looks representative.”
Chernoff bounds replace that with an actual sentence:
If the sample is built from many independent Bernoulli-style trials, the probability of a large miss drops exponentially fast.
You do not need every proof detail to get the practical point: repeated independent evidence compounds very nicely.
A sample can estimate, not just represent
Up to this point, sampling has mostly meant choosing things fairly. But there is a broader reason randomness keeps sneaking into software: a random sample can estimate a quantity that is too expensive to compute exactly.
This is the Monte Carlo idea in its most basic form. If are independent samples from some distribution and you care about , the Monte Carlo estimator is just the sample average:
That estimator is unbiased, and its variance shrinks like . Under the Central Limit Theorem, the typical error scale behaves like
That rate is slow, but still useful enough to power simulation, rendering, Bayesian computation, reinforcement learning, and finance.
A tiny simulation
Here is a small experiment that estimates how often a Bernoulli sum deviates by more than a chosen relative threshold:
function bernoulli(p: number, rng: () => number): number { return rng() < p ? 1 : 0;}
function trial(n: number, p: number, rng: () => number): number { let x = 0; for (let i = 0; i < n; i++) { x += bernoulli(p, rng); } return x;}
function estimateTailProbability( n: number, p: number, delta: number, repetitions: number): number { const mu = n * p; let bad = 0;
for (let r = 0; r < repetitions; r++) { const x = trial(n, p, Math.random); if (Math.abs(x - mu) >= delta * mu) { bad += 1; } }
return bad / repetitions;}
function chernoffBound( n: number, p: number, delta: number): number { const mu = n * p; return 2 * Math.exp(-(mu * delta * delta) / 3);}
const n = 1000;const p = 0.3;const delta = 0.1;const reps = 10000;
console.log( "empirical tail:", estimateTailProbability(n, p, delta, reps));console.log("Chernoff bound:", chernoffBound(n, p, delta));The empirical number will be small. The Chernoff bound will be larger but still reasonable. The gap is the price of a universal guarantee: the bound holds for any Bernoulli sum without needing to know its exact distribution.
What this has to do with databases and systems
A lot. If you sample rows independently with some probability and compute a simple proportion, you are exactly in the world where indicator variables make sense. “This row matches” is a Bernoulli variable. “This request failed” is a Bernoulli variable. “This feature flag fired” is a Bernoulli variable.
And if you use random draws to estimate a quantity instead of computing it exactly, you are doing Monte Carlo whether or not you call it that.
Real systems have correlation, drift, and all kinds of bad manners. But this is the first mathematical model that explains why sampling is not automatically reckless.
One caution
Chernoff bounds need assumptions. Independence matters. Bernoulli structure matters. If your data is heavily correlated or your sample is biased before you start counting, the bound does not magically rescue you.
That is not a flaw in the theorem. It is a reminder that probability bounds and sampling design are partners, not substitutes.
Once you see that random sampling can stand in for an expensive exact computation, a lot of later algorithms stop looking like hacks and start looking like disciplined approximations. But first, there is another compromise: tiny data structures that lie just enough. That is the next post.