The Free Transformer

François Fleuret – 2025

Making latent variables explicit (explicitly conditioning on them) substantially simplifies probabilistic modeling. Attempting to model densities as autoregressive, which is to say, without explicitly conditioning on latent variables but instead conditioning only on past values, generates tons of complexity.

Why does this happen? (Hypothesis) When you don’t explicitly include the relevant latent variables, you implicitly make inferences about what they might have been based on the realized values of the prior values of the thing you’re modeling. You look at those past values and essentially come up with an educated guess for what value the latent variable took on. The problem is that this educated guess can be extremely complicated to generate precisely.

Quote

It requires an unnecessarily complicated computation, and greater capacity, to implicitly make post-hoc decisions or infer latent quantities from the generated tokens.

Aside from complexity, autoregressive modeling can set you astray if some of the tokens are way off the mark, making it hard to draw the proper inferences about the latent variable and thus the correct next token.

Quote

It may be sent off track during the process if, by mistake, a few tokens generated are erroneous, ambiguous or contradictory with those generated previously.

The authors propose using what is in effect a VAE to learn a “good” distribution for latent variable , , the goal of which is to sample latents which are useful in structuring the generative process for , the training sample. We can’t really know in advance what sorts of s will be helpful or that the model will choose to learn during training.


References

https://www.youtube.com/watch?v=Nao16-6l6dQ

paperreadonline