When a Latent Space Becomes Sampleable
Series parts
On this page
A plain autoencoder sounds like it should already solve the problem.
You take an input , compress it into a hidden code , then reconstruct from . Hidden representation learned. Job finished.
And for some jobs, that is true. But there is a catch: a code that is useful for reconstruction is not automatically a code you can sample from.
That distinction is the whole reason variational autoencoders exist.
What an ordinary autoencoder gives you
A basic autoencoder learns a hidden code that is sufficient to reconstruct the input through a decoder .
Perfectly good objective. But it does not force the hidden codes to occupy the latent space in any especially tidy way. Two nearby codes may decode into related things, or they may not. Large regions of latent space may never be used at all.
So if you take a random point from the hidden space of a plain autoencoder and decode it, you may get garbage. The model was never asked to make arbitrary latent points meaningful.
What a VAE changes
A variational autoencoder keeps the encoder-decoder flavor but adds a probabilistic latent-variable model underneath.
The generative picture becomes:
- Assume a latent variable .
- Assume a prior , often .
- Generate from .
- Learn an approximate posterior .
A VAE is not just “compress and reconstruct.” It is “learn a generative model with hidden variables, and learn an approximate posterior over those hidden variables.”
The main intuition
A plain autoencoder learns a code. A VAE tries to shape that code space so it matches a known prior distribution, letting you sample latent points from the prior and decode them into plausible outputs.
The trick that makes training work
The encoder in a VAE does not output one deterministic code. It outputs the parameters of a distribution, often a mean and variance for a Gaussian approximate posterior.
That leads to the famous move: the reparameterization trick.
Instead of sampling directly in a way that blocks gradients, you write the sample as a deterministic function of encoder outputs and external noise:
That one step is what makes the model trainable with backpropagation. Randomness stays, but it is pushed into , while the path from loss to and remains differentiable.
Here, a small PyTorch example is the clearest option. This is training code, and Python keeps the mechanics visible.
import torchimport torch.nn as nn
class Encoder(nn.Module): def __init__(self, d_in=784, d_hidden=256, d_latent=2): super().__init__() self.net = nn.Sequential( nn.Linear(d_in, d_hidden), nn.ReLU(), ) self.mu = nn.Linear(d_hidden, d_latent) self.logvar = nn.Linear(d_hidden, d_latent)
def forward(self, x): h = self.net(x) return self.mu(h), self.logvar(h)
class Decoder(nn.Module): def __init__(self, d_latent=2, d_hidden=256, d_out=784): super().__init__() self.net = nn.Sequential( nn.Linear(d_latent, d_hidden), nn.ReLU(), nn.Linear(d_hidden, d_out), nn.Sigmoid(), )
def forward(self, z): return self.net(z)
def reparameterize(mu, logvar): std = torch.exp(0.5 * logvar) eps = torch.randn_like(std) return mu + std * epsThat reparameterize function contains the key step.
The objective
The training objective is often written as an evidence lower bound, or ELBO. The intuition is enough:
- Reconstruct the input well.
- Keep the approximate posterior close to the chosen prior.
Those two pressures together make the latent space navigable. If reconstruction pressure acted alone, the model could memorize awkward hidden codes. If the prior pressure acted alone, the latent space would be neat but useless. The point is to balance the two.
Why this matters for the series
This is the first post where sampling enters a learned hidden world in a direct way. Earlier posts sampled from finite sets, weighted distributions, streams, probabilistic summaries, search neighborhoods, classifier outputs. Now we are sampling from a learned latent space.
A VAE is what happens when latent variables stop being only a modeling convenience and become an actual generative coordinate system.
A VAE generates by sampling a latent vector and decoding it all at once. A language model generates differently: one discrete choice after another. That is the next post.