Skip to main content

Why Random Rows in SQL Are Weirdly Expensive

· 5 min read
Series parts
  1. Part 5Why Random Rows in SQL Are Weirdly Expensive
On this page

Asking a database for random rows feels like it should be boring.

You are not asking for a recursive query. You are not doing graph analytics. You are not hand-optimizing a terrible join. You just want, say, 100 random users from a table.

It turns out to be hard enough to teach a useful lesson: in software, “random sample” is not just a probability problem. It is also a storage problem.

Rows live on disk pages. Tables are filtered and scanned in particular orders. Some sampling methods work at the row level, some at the block level, and some give you approximate sizes rather than exact counts. Once the data has a physical layout, randomness stops being an abstract ideal and starts negotiating with the engine.

The naive query

The first thing many people write is some variation of:

SELECT *
FROM users
ORDER BY random()
LIMIT 100;

This is a perfectly understandable query. It is also a hint that you are asking the engine to assign a random value to every candidate and then take the top few.

On a small table, that is fine. On a large table, it is exactly the kind of innocent-looking query that can become much more expensive than you meant. If you randomize every row and only keep a few, the engine touches a lot more of the table than the final answer suggests.

So the interesting question is not “can SQL do random rows?” It obviously can. The interesting question is:

What kind of randomness do you actually need, and what are you willing to pay for it?

Different sampling goals

Modern SQL engines often expose sampling features that make the trade-offs clearer. At a conceptual level, there are at least three different questions you might be asking:

  • Give me an approximately random fraction of the table.
  • Give me exactly k rows.
  • Give me a sample that is reproducible.

Those are different requirements, and each one pushes you toward a different implementation.

Bernoulli-style row sampling

A natural model is row-level independent sampling: keep each row with probability pp.

This is conceptually clean, and it leads to one nice formula immediately:

E[S]=pn.E[|S|] = pn.

If a table has n rows and each row is independently kept with probability p, then the expected sample size is pn. Expected size is not exact size. If you need roughly 1% of a huge table, a Bernoulli-style sample is a good fit. If you need exactly 100 rows, it is not the same problem.

Block-level sampling

A database may also sample at the page or block level rather than the row level. That can be much faster, because the engine operates closer to the physical storage layout.

But it also means the sample quality is partly shaped by that layout. If related rows happen to live near each other on disk, a block-level sample can overrepresent local neighborhoods of the data.

That is the pattern this post emphasizes:

  • Row-level randomness is closer to the probability story.
  • Block-level randomness is closer to the storage story.

Neither is universally right. They answer slightly different questions.

Reproducibility matters

There is another detail that sounds boring until you need it: reproducibility.

In analysis or debugging, “give me a random sample” often really means:

Give me a random-looking sample that I can get again tomorrow.

That is a very different request from “surprise me every time.” So one useful axis in database sampling is not just speed or representativeness, but whether the sampling rule can be seeded or repeated in a stable way.

Randomness and reproducibility are not enemies. You can have both, if you decide which one you care about.

The SQL landscape

-- Naive: easy to write, often costly on large tables
SELECT *
FROM users
ORDER BY random()
LIMIT 100;
-- Approximate fraction, conceptually row-level
SELECT *
FROM users
TABLESAMPLE BERNOULLI (1);
-- Approximate fraction, closer to physical storage
SELECT *
FROM users
TABLESAMPLE SYSTEM (1);
-- Reproducible, if the engine supports it
SELECT *
FROM users
TABLESAMPLE SYSTEM (1)
REPEATABLE (42);

These examples show how quickly one request turns into several distinct trade-offs.

What this has to do with sampling

This is where sampling meets engineering constraints directly.

In the previous posts, sampling was a probability operation on abstract sets or streams. Here, the set has physical geometry. Rows are grouped into pages. Pages are stored in a particular order. The sampling method has to respect the engine’s access patterns, or pay the price.

That tension shows up everywhere real systems use randomness. The probabilistic goal may be clear, but the implementation still has to work with the engine’s access patterns and storage layout.


A database sample brings you to the most natural skeptical question: “okay, but how much can I trust this?” That question deserves a mathematical answer, not just a shrug. That is the next post.