The Useful Fiction of Hidden Causes

The dashboard can show a checkout timeout, a retry spike, and a batch job falling behind. It cannot show a column named cause.

A deploy, database saturation, crawler traffic, or a third-party outage might explain those symptoms. Introducing an unobserved state can make the observations easier to model. It does not make that state a discovered fact.

The classic toy example uses documents. You never observe a topic field, only words. A hidden topic can make goal, league, and coach likely in one state while market dominates another. The gap between useful explanation and literal truth is what makes latent variable models interesting.

The main idea

A latent variable model says: the visible data may be easier to explain if we assume there is some unobserved structure underneath it.

The basic setup is a probability distribution over observed variables $x$ and unobserved variables $z$ :

p(x, z; \theta).

The dataset contains $x$ ; the model posits $z$ . The notation does not tell you whether $z$ is a topic, a cluster, a hidden state, or a continuous coordinate.

A mixture model

Suppose each observation comes from one of $K$ hidden components. Then the joint factorizes as

p(x, z) = p(x \mid z)\, p(z),

and the marginal over the observed variable is

p(x) = \sum_{k=1}^{K} p(x \mid z = k)\, p(z = k).

The generative story is simple: first draw a hidden component, then draw an observation from the distribution attached to it.

What the hidden state buys

Because sometimes one blunt model of the visible data is worse than several sharper models indexed by a hidden cause.

Instead of trying to model all documents with one giant word distribution, you might assume there is a hidden topic variable that changes the likely vocabulary.

Mixtures, topic models, hidden Markov models, factor models, and many generative models all use some version of this move. The hidden state earns its place through better likelihood, compression, prediction, or downstream usefulness. Otherwise it is only a story the model tells itself.

A tiny discrete toy

Imagine a hidden topic z that can be either “sports” or “finance.” Once the topic is chosen, the observed word distribution changes.

hidden-topic-sampler.ts

function sampleCategorical<T>(
  items: T[],
  probs: number[]
): T {
  const u = Math.random();
  let acc = 0;

  for (let i = 0; i < items.length; i++) {
    acc += probs[i];
    if (u < acc) return items[i];
  }

  return items[items.length - 1];
}

const topics = ["sports", "finance"] as const;
const topicProbs = [0.6, 0.4];

const wordsByTopic = {
  sports: {
    words: ["goal", "coach", "league", "market"],
    probs: [0.35, 0.25, 0.30, 0.10],
  },
  finance: {
    words: ["goal", "coach", "league", "market"],
    probs: [0.05, 0.05, 0.10, 0.80],
  },
};

const z = sampleCategorical([...topics], topicProbs);
const x = sampleCategorical(
  wordsByTopic[z].words,
  wordsByTopic[z].probs
);

console.log({ z, x });

This is not a topic model. It is the latent-variable pattern reduced to one draw: sample z, then sample x | z. The word is visible; the selected topic exists only inside the model.

What makes learning harder

Latent variables can make the model richer, but they also make learning harder. Training must infer a latent value for each example while fitting parameters that never receive direct z labels. Even evaluating the posterior $p(z \mid x)$ may be expensive or intractable.

That inference problem is the price of a richer distribution over the observations.

The label is not the thing

Even a well-fitting latent model may not identify one unique hidden explanation. Swap the labels of two mixture components and the observed distribution stays unchanged. A topic I call “finance” may split in another run, merge with business news, or encode a regularity no human would name that way.

Latent variables are model coordinates. Treating them as discovered causes requires evidence outside the latent model itself.