A Latent Space Is Not Automatically a Place

An autoencoder’s latent space looks like a place only because we draw it as one.

The training objective asks the encoder for codes that let the decoder reconstruct its inputs. It does not ask what should happen at an arbitrary coordinate nobody encoded. For reconstruction, that contract can be enough. For generation, it leaves no principled answer to the first question: where should a new latent point come from?

A variational autoencoder answers by specifying a probabilistic generative model and learning approximate inference for it.¹

What an ordinary autoencoder gives you

A basic autoencoder learns a hidden code $h = f(x)$ that is sufficient to reconstruct the input through a decoder $\hat{x} = g(h)$ .

That is a perfectly good objective. It does not force the hidden codes to occupy the latent space in a known distribution. The encoder can place training examples on separated regions and leave large areas unused.

An arbitrary point can therefore decode badly without the autoencoder failing. We asked for reconstruction and silently hoped for geography.

What a VAE changes

A variational autoencoder keeps the encoder-decoder flavor but adds a probabilistic latent-variable model underneath.

The generative picture becomes:

Assume a latent variable $z$ .
Assume a prior $p(z)$ , often $\mathcal{N}(0, I)$ .
Generate $x$ from $p(x \mid z)$ .
Learn an approximate posterior $q(z \mid x)$ .

A VAE is not just “compress and reconstruct.” It is “learn a generative model with hidden variables, and learn an approximate posterior over those hidden variables.”

The practical intuition

A plain autoencoder learns codes. A VAE chooses a prior, commonly $p(z)=\mathcal{N}(0,I)$ , and trains a decoder as part of a model that should assign probability to observations generated from prior samples.

People often summarize the KL term as “making the latent space smooth.” That picture is useful and incomplete. Optimizing the ELBO does not guarantee that every interpolation is meaningful, that the aggregated posterior exactly matches the prior, or that the decoder uses the latent variable at all.

The trick that makes training work

The encoder in a VAE does not output one deterministic code. It outputs the parameters of a distribution, often a mean and variance for a Gaussian approximate posterior.

That leads to the famous move: the reparameterization trick.

Instead of sampling $z$ directly in a way that blocks gradients, you write the sample as a deterministic function of encoder outputs and external noise:

z = \mu(x) + \sigma(x)\odot \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I).

This produces a low-variance pathwise gradient estimator: randomness stays in $\varepsilon$ , while the sample remains a differentiable function of $\mu$ and $\sigma$ . Other gradient estimators exist, so reparameterization is not the only conceivable way to train a stochastic model. It is the move that made this continuous-latent setup practical with ordinary backpropagation.

Here, a small PyTorch example is the clearest option. This is training code, and Python keeps the mechanics visible.

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, d_in=784, d_hidden=256, d_latent=2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_hidden),
            nn.ReLU(),
        )
        self.mu = nn.Linear(d_hidden, d_latent)
        self.logvar = nn.Linear(d_hidden, d_latent)

    def forward(self, x):
        h = self.net(x)
        return self.mu(h), self.logvar(h)

class Decoder(nn.Module):
    def __init__(self, d_latent=2, d_hidden=256, d_out=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_latent, d_hidden),
            nn.ReLU(),
            nn.Linear(d_hidden, d_out),
            nn.Sigmoid(),
        )

    def forward(self, z):
        return self.net(z)

def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)
    eps = torch.randn_like(std)
    return mu + std * eps

That reparameterize function contains the key step.

The objective

The training objective is often written as an evidence lower bound, or ELBO:

\mathcal{L}(x)= E_{q(z\mid x)}[\log p(x\mid z)] -D_{\mathrm{KL}}(q(z\mid x)\,\|\,p(z)).

It is a lower bound because $\mathcal{L}(x) \le \log p(x)$ . Maximizing it both raises a tractable objective for the generative model and improves the approximate posterior used to compute that objective.

The formula is doing two jobs:

Reconstruct the input well.
Keep the approximate posterior close to the chosen prior.

If the reconstruction term dominates, the approximate posterior can drift into regions that are hard to reach from the prior. If the KL term dominates, the decoder may ignore $z$ and the approximate posterior may collapse toward the prior. The objective exposes the tension; it does not balance the two automatically.

A VAE gives generation an explicit starting distribution, but its guarantees are probabilistic and objective-dependent, not cartographic. The map metaphor helps only while I remember that the loss, not the diagram, defines the territory.