How Wrong Can a Random Sample Be?

A fair sampler can still produce an unlucky sample. Fairness tells me the procedure did not favor a row. It does not tell me whether 61% in the sample means 60.9%, 58%, or 40% in the population.

The useful question is numerical: under explicit assumptions, how unlikely is a miss this large? Concentration bounds answer it without pretending that “looks representative” is a measurement.

I will start with the forward problem. If each request fails independently with a known probability, how far can the observed count stray from its expectation? This is narrower than inferring an unknown population rate from one sample, and it gives us a guarantee we can state exactly.

The setup

Imagine n independent yes/no random variables $X_1, X_2, \dots, X_n$ , where each $X_i$ is 1 if something happened and 0 otherwise.

This is the right model for many software questions: did this request fail? Did this row match the predicate? Did this randomized test pass? Did this user click?

Let $X = \sum_{i=1}^n X_i$ be the total number of successes, and let $\mu = E[X]$ .

Now the question becomes: how likely is it that $X$ strays far from $\mu$ ?

A useful Chernoff bound

There are many forms of Chernoff bounds. A compact two-sided version is:

P\!\left(|X-\mu|\ge \delta \mu\right)\le 2e^{-\mu\delta^2/3},

0 < \delta < 1.

For independent Bernoulli variables, this form of the bound says that large relative deviations become exponentially unlikely.¹

The decay is exponential in $\mu$ , not just qualitatively small. The more expected mass you have, the tighter the concentration gets.

What it means in plain English

Suppose the expected number of positives is $\mu = 1000$ , and being off by 20% or more would count as a failure.

Then the bound says

P(|X-\mu|\ge 0.2\mu)\le 2e^{-1000\cdot 0.04/3} = 2e^{-40/3}.

The result is about $3.24 \times 10^{-6}$ : fewer than 3.24 failures per million trials under the bound. The exact binomial tail is smaller still. A rough phrase such as “ten thousand samples should be enough” has become a falsifiable guarantee.

The constant $1/3$ is less interesting than the shape. Double the expected evidence while holding the relative-error threshold fixed and the exponent doubles. The failure bound does not halve; it squares.

From intuition to a guarantee

Without a concentration bound, sampling often gets explained vaguely: “it should be close,” “usually this works,” “the sample looks representative.”

Chernoff bounds replace that with an actual sentence:

If the sample is built from many independent Bernoulli-style trials, the probability of a large miss drops exponentially fast.

The assumptions are doing real work. Independence is not decorative, and a biased sampling procedure does not become sound because a clean inequality appears afterward.

This is also a forward guarantee around a known expectation, not a confidence interval reverse-engineered from one observed sample. Estimating an unknown population proportion requires another step. I am keeping the narrower statement because it is the one the formula actually proves.

A sample can estimate, not just represent

Up to this point, sampling has mostly meant choosing things fairly. But there is a broader reason randomness keeps sneaking into software: a random sample can estimate a quantity that is too expensive to compute exactly.

This is the Monte Carlo idea in its most basic form. If $X_1,\dots,X_N$ are independent samples from some distribution and you care about $\mathbb{E}[f(X)]$ , the Monte Carlo estimator is just the sample average:

\hat{\mu}_N = \frac{1}{N}\sum_{i=1}^N f(X_i).

For independent samples and finite variance $\sigma^2 = \operatorname{Var}(f(X))$ , that estimator is unbiased and its root-mean-square error is

\sqrt{\mathbb{E}\!\left[(\hat{\mu}_N-\mathbb{E}[f(X)])^2\right]} = \frac{\sigma}{\sqrt{N}}.

The square-root rate is humbling: to cut error by ten, you need one hundred times as many samples. It is still useful because the rate does not depend directly on the dimension of the integration domain, which is why Monte Carlo survives in problems where grids explode.

A tiny simulation

Here is a small experiment that estimates how often a Bernoulli sum deviates by more than a chosen relative threshold:

estimate-tail-probability.ts

function bernoulli(p: number, rng: () => number): number {
  return rng() < p ? 1 : 0;
}

function trial(n: number, p: number, rng: () => number): number {
  let x = 0;
  for (let i = 0; i < n; i++) {
    x += bernoulli(p, rng);
  }
  return x;
}

function estimateTailProbability(
  n: number,
  p: number,
  delta: number,
  repetitions: number,
  rng: () => number
): number {
  const mu = n * p;
  let bad = 0;

  for (let r = 0; r < repetitions; r++) {
    const x = trial(n, p, rng);
    if (Math.abs(x - mu) >= delta * mu) {
      bad += 1;
    }
  }

  return bad / repetitions;
}

function chernoffBound(
  n: number,
  p: number,
  delta: number
): number {
  const mu = n * p;
  return 2 * Math.exp(-(mu * delta * delta) / 3);
}

const n = 1000;
const p = 0.1;
const delta = 0.3;
const reps = 50000;

console.log(
  "empirical tail:",
  estimateTailProbability(n, p, delta, reps, Math.random)
);
console.log("Chernoff bound:", chernoffBound(n, p, delta));

The code uses a smaller example, $\mu=100$ and $\delta=0.3$ , so 50,000 repetitions actually expose some tail events. The bound is deliberately loose: it is a reusable guarantee, not the exact binomial probability. Running the earlier million-scale example would usually print zero observed misses and teach almost nothing.

Where the assumptions break

Indicator variables map neatly onto software events. “This row matches,” “this request failed,” and “this feature flag fired” can each become a Bernoulli variable when the underlying trials satisfy the model.

And if you use random draws to estimate a quantity instead of computing it exactly, you are doing Monte Carlo whether or not you call it that.

Real systems bring correlation, drift, survivorship bias, and retries that arrive in bursts. The theorem has not failed when those assumptions fail; the model has stopped describing the system.