Latent Diffusion Models

Diffusion models rival and can even surpass GANs on image synthesis, as they are able to generate more diverse outputs by having a better coverage of the data distribution and do not suffer from mode collapse and training instabilities of GANs.

They can properly model the semantic structure of data and exhibit an impressive fine-grained level of details, however, due to their sampling nature they are slower than GANs.

Latent Diffusion Models (LDMs) introduced in [1] are one of the most robust models for image synthesis generating images that are perceptually at the level of GANs or even better, having the semantic power of Transformers and the highly detail level of Diffusion models. They do so by performing a diffusion process in the latent space as opposed to pixel space, and including semantic contribution from a Transformer.

Diffusion models are likelihood-based models trained in pixel space in two stages: perceptual compression and semantic compression, see Fig 1.

Fig. 1  Illustration of perceptual and semantic compression [1]

Perceptual compression is the step in which high-frequency details are removed, and only a small semantic variation is learned. Semantic compression is the stage where the model actually learns the conceptual structure of the data.

Latent Diffusion models introduced a perceptually similar but computationally more efficient space where diffusion models are trained for image synthesis. Most bits in an image do not encode essential details, and LDMs efficiently generate images that only remove imperceptible details. LDMs can be trained to perform many tasks, and one example is image super-resolution implemented by using a direct conditioning of the low-resolution input.

Fig. 2  Example of a LDMs model trained for super-resolution, left input and right generated image [1]

Gabriela Ghimpeteanu, Coronis Computing.


[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser and Björn Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models”, arXiv:2112.10752, 2021.