High-Resolution Image Synthesis with Latent Diffusion Models

How Stable Diffusion works

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at this https URL . https://arxiv.org/abs/2112.10752

Diffusion Model (DM) decomposes the image formation process into the sequential application of denoising autoencoders to achieve state-of-the-art synthesis results for image data and beyond. Furthermore, its formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days, and inference by sequential evaluation is also expensive To enable training of DMs with limited computational resources while maintaining DM quality and flexibility, we have developed powerful pre-trained auto apply DM in the latent space of encoders. Unlike previous work, learning diffusion models in this representation allows us to reach the optimal point between complexity reduction and detail preservation for the first time and to significantly improve visual fidelity. The introduction of a cross-attention layer in the model architecture also makes the diffusion model a powerful and flexible generator for common conditional inputs such as text and bounding boxes, allowing for high-resolution synthesis in a convolutional fashion. Our Latent Diffusion Model (LDM) is computationally much less expensive than pixel-based DM, while achieving a new technical level for image inpainting and for a variety of tasks including unconditional image generation, semantic scene composition, super-resolution (e.g., video), and We achieved a high level of competitiveness. The code is available at this https URL

The diffusion model is the decomposition of the image formation process into the sequential application of noise reduction autoencoders
Guiding mechanisms can be implemented to control the image generation process without retraining
- Talk about controlling images generated by text prompts.
Doing this in pixel space is expensive
- So we do it in the latent vector space
Introducing a cross-attention layer to the model architecture makes the diffusion model a powerful and flexible generator for common conditional inputs such as text and bounding boxes, and enables high-resolution synthesis in a convolutional fashion.

This page is auto-translated from [/nishio/High-Resolution Image Synthesis with Latent Diffusion Models](https://scrapbox.io/nishio/High-Resolution Image Synthesis with Latent Diffusion Models) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I’m very happy to spread my thought to non-Japanese readers.

🪴 Quartz 4.0

High-Resolution Image Synthesis with Latent Diffusion Models

Graph View

Backlinks