# Stable Diffusion Interpretation: Review of Early Work

In this wave of AI painting, Stable Diffusion is undoubtedly the most popular image generation model.

Look at the reasons behind. Firstly, Stable Diffusion significantly improves the operating efficiency of the diffusion model by compressing image size, allowing each users to run the model on their own commercial graphics card. Secondly, there are many applications based on Stable Diffusion, such as text-to-image, image completion comes with Stable Diffusion, and many plug-in applications including ControlNet, LoRA, DreamBooth etc.. Thirdly, benefits from the first two reasons, Stable Diffusion has formed a huge user community, which users share models and insights.

Besides the general public, Stable Diffusion has also attracted a large number of scientific researchers. However, people encountered difficulties when learning Stable Diffusion. It is a model with many formulas, VAE, U-Net, how to start it?

In this article, we will introduce from the beginning how Stable Diffusion was born step by step from the early work for readers who have never been exposed to image generation.

## Table of Contents

- Start with autoencoders
- The first route: VAE & DDPM
- The second route: VQVAE
- The intersection of routes: Stable Diffusion
- Summary
- Related Papers

## Start with autoencoders

Many image generation models, including Stable Diffusion, can be seen as an improved version of a very simple model - the autoencoder. To talk about how Stable Diffusion was gradually born, we are actually talking about the step-by-step evolution of autoencoders.

Although image compression methods such as PNG and JPG are very mature, The wonder is, are there any better image compression algorithms?

Image compression is actually about finding two mappings, one to encode the image into compressed data, and the other to decode the compressed data back into the image. We know that neural networks can theoretically fit any mapping. Then we simply use two neural networks to fit two mappings to implement an image compression algorithm.

The neural network responsible for encoding is called **Encoder**, and the neural network responsible for decoding is called **Decoder**. We also need to set a learning goal for the two neural networks.

During operation, the neural network should satisfy an obvious constraint: the reconstructed image after encoding and decoding should be as consistent as possible with the original image, that is, the mean square error of the two should be as small as possible.

In this way, we only need to find a random picture and get the reconstructed image through the encoder and decoder to train the neural network. We do not need to label the images, the entire training process is self-supervised. So we say that the entire model is an **autoencoder (AE).**

Why is the image compression model AE related to image generation? You can try to cover the AE input image and encoder and only see the decoding part. Decoding a compressed data into an image, looking at it from another angle, is generating an image based on a certain data.

Unfortunately, AE is not a qualified image generation model. The image generation we often talk about specifically refers to letting the program generate a variety of pictures. In order for the program to generate different pictures, we usually let the program generate pictures based on random numbers (or random vectors). Ordinary AE will have over-fitting phenomenon, which causes the AE decoder to only recognize the compressed data decoded by the encoder of the pictures in the training set, but not the randomly generated compressed data, and thus cannot meet the requirements of image generation.

The so-called overfitting means that the model can only process training data and cannot be generalized to general data. To take an extreme example, as shown in the figure below, the encoder and decoder directly memorize the entire data set and compress all pictures into a number. That is, the model treats the encoder as a dictionary from pictures to numbers, and the decoder as a dictionary from numbers to pictures. This way, no matter how big the data set is, all images can be compressed into a single number. Such AE does have strong compression capabilities, but it is completely useless because it is overfitted and cannot handle data outside the training set.

The overfitting phenomenon is inevitable in the ordinary version of AE. In order to use AE's decoder to generate images, many works are trying to overcome the over-fitting phenomenon of AE. There are many ideas for improving AE. In this article, we only roughly divide the improvement routes of AE into two types: solving the over-fitting problem to directly use AE for image generation, and using AE to compress images to indirectly achieve image generation.

## The first route: VAE & DDPM

In the first line of improvement, many subsequent works tried to use more advanced mathematical models to solve the overfitting problem of AE. **Variational Autoencoder (VAE)** is one of the representatives.

VAE has made several changes to AE. First, VAE makes the output of the encoder no longer a certain data, but a random data in a normal distribution. More specifically, during training, the encoder will output a mean and variance at the same time. Subsequently, the model will randomly sample a data from the normal distribution expressed by this mean and variance as the input of the decoder. Intuitively, this change allows the encoder to output an extra variance based on AE, causing a little random disturbance in the output of the original AE encoder.

This change can alleviate the overfitting phenomenon. Why is this? We can think of it this way: The reason why the original AE overfits is because it forcibly remembers the encoding output of each data in the training set. Now, we ask the encoder in VAE to no longer output a fixed value, but to randomly output a value near the mean. In this case, VAE cannot memorize by rote, but must find patterns in the data.

The second change to VAE is to add an additional learning objective to make the output of the encoder as similar as possible to the standard normal distribution. As we talked about earlier, image generation models generally generate images based on a random vector. The most common way to generate random vectors is to sample from the standard normal distribution. In other words, when using VAE to generate images, we will throw away the encoder and use the process shown in the figure below to generate images. If we do not constrain the output distribution of the encoder so that it outputs a distribution that is very similar to the standard normal distribution, the decoder will not be able to generate images well based on random vectors from the standard normal distribution.

In summary, VAE has made two improvements to AE: the encoder outputs a normal distribution, and the distribution should be as similar as possible to the standard normal distribution. During training, the model randomly samples a data from the distribution of the encoder output as the input of the decoder; during image sampling (image generation), the model randomly samples a data from the standard normal distribution as the input of the decoder. The error function of VAE consists of two parts: the reconstruction error of the original image and the reconstructed image, and the error between the encoder output and the standard normal distribution. VAE should minimize the reconstruction error and maximize the similarity between the encoder output and the standard normal distribution.

The error between distributions can be expressed by an indicator called KL divergence. Therefore, in the error function formula above, the negative similarity should be replaced by KL divergence. These two changes to VAE essentially solve the over-fitting problem of AE, so the changes to VAE can be seen as a regularization method. We can refer to the regularization method of VAE as KL regularization. (In machine learning, regularization method is the abbreviation of "method to reduce model overfitting".)

VAE can indeed alleviate the overfitting of AE. However, since VAE only makes the mean square error (reconstruction error) between the reconstructed image and the original image as small as possible without imposing more constraints on the quality of the reconstructed image, the reconstruction results and image generation results of VAE are very blurry. The following are the image generation results of VAE on the CelebA dataset.

Among the many improvements to VAE, an image generation model called the Denoising Diffusion Probabilistic Model (DDPM) stands out. DDPM is the originator of today's diffusion models. Let's take a look at how DDPM models image generation based on VAE.

The reason why VAE does not work well is probably because it has too few constraints. The encoding and decoding of VAE are both represented by neural networks. The neural network is a black box. We cannot impose constraints on the intermediate steps of the neural network. We have to impose constraints on the output of the encoder (a certain normal distribution) and the output of the decoder (reconstructed image). Can the VAE encoding and decoding process be made more controllable?

The design of DDPM is inspired by thermodynamics: one distribution can gradually transform into another distribution through a series of simple changes (such as adding Gaussian noise). Just as it happens, doesn't the VAE encoder want the images from the training set (training set distribution) to become a standard normal distribution? In this case, instead of using a learnable neural network to represent the VAE encoder, simply use some predefined noise operations to represent the decoding process. It can be mathematically proven that after multiple noise addition operations, the final image distribution will be a standard normal distribution.

Since encoding adds noise, the noise should be removed during decoding. The DDPM decoder is no longer an uninterpretable neural network, but a neural network that can predict several denoising results.

Compared with VAE, which has only two constraints, DDPM has more constraints. In DDPM, attention should be paid to replacing the first denoising operation with the first denoising operation.

Let’s take a more specific look at the learning objectives of DDPM. The so-called adding noise means sampling from a normal distribution whose mean is approximately equal to the current image. For example, to add noise to an image, we can sample a new image from this distribution. The mean value of each pixel in the new image is about 0.9 times the original, and there will be a lot of noise in the new image. We set it to be the normal distribution with noise added at step 1. After some mathematical derivation, we can find the inverse operation of this operation. This inverse operation of adding noise is also a normal distribution. In this case, we can assume that the de-noising step is also a normal distribution, so that the inverse operations of de-noising the noise and adding noise in the step-th step are as similar as possible.

To summarize, DDPM has made the following changes to VAE:

- The encoder is a series of non-learnable (fixed) noise-adding operations.
- A decoder is a learnable sequence of denoising operations.
- The image size remains the same throughout.

Compared with VAE, the encoding process and decoding process of DDPM are more clearly defined and can impose more constraints. Therefore, as shown in the figure below, its generation effect will be much better than VAE. At the same time, DDPM is similar to VAE. It samples from the distribution when encoding, instead of just outputting a fixed value, and does not cause the over-fitting problem of AE.

The generation effect of DDPM is really good. However, since DDPM always operates on data of the same size, the size of the image greatly affects the running speed of DDPM. Generating high-resolution images with DDPM requires a large amount of computing resources. Therefore, if you want to use DDPM to generate high-quality images, you have to go through another route.

## The second route: VQVAE

In the second improvement route of AE, some work simply gave up using AE for image generation, and instead used AE's image compression capabilities to split the image generation into two steps: first, use AE's encoder to compress the image into a more for small images, use another image generation model to generate small images, and use the AE decoder to reconstruct the small images back to the real images.

Why is there such a strange image generation method? This starts with another type of image generation model. In the period after the machine translation model Transformer was born, there was a lot of work trying to use Transformer in image generation. However, the Transformer originally used to generate text cannot be directly applied to images. In natural language processing (NLP), a sentence can be represented by several words. Each word is represented by an integer. Therefore, when the Transformer generates a sentence, it is actually generating several discrete integers, that is, a discrete vector. In the image generation model, the color value of each pixel is a continuous floating point number. If you want to use Transformer directly for image generation, you have to find a way to represent the image as a discrete vector. We know that AE can encode an image into a continuous vector. Can I make some modifications to let AE encode the image into a discrete vector?

Vector Quantised-Variational AutoEncoder (VQVAE) is an AE that can encode images into discrete vectors (although the author used VAE when naming it). Let's take a brief look at how VQVAE encodes images into discrete vectors.

Suppose we have an AE that encodes discrete vectors.

Since neural networks cannot handle discrete data well, we need to introduce the common practice in NLP and add an embedding layer that maps discrete vectors into continuous vectors.

Now let's go back to discuss how to make the encoder output a discrete vector. We can keep the AE decoder unchanged, or output a continuous vector, and then use a "vector discretization" operation to turn the continuous vector into a discrete vector. This operation will align the output of the encoder to the vector of the embedding layer. Its principle is similar to discretizing 0.99 and 1.01 into 1, except that it considers the vector as a whole instead of considering each number individually. We will not study the specific principles of vector discretization operations in detail here.

Ignoring the implementation details, we can think that VQVAE can compress images into discrete vectors. More precisely, VQVAE can compress images into discrete "small images" in equal proportions. Compressing into a two-dimensional image instead of a one-dimensional vector can retain some spatial characteristics of the original image, paving the way for the second step of image generation.

To summarize, VQVAE is an AE that can compress images into discrete small images. In order to generate images with VQVAE, a two-stage image generation process needs to be performed:

- During training, first train an image compression model (VQVAE), and then train a model that generates compressed images (such as Transformer).
- When generating, first use the second model to generate a compressed image, and then use the decoder of the first model to restore the compressed image to a real image.

There are two reasons why we perform a two-stage image generation process instead of just using a second model to generate large images.

The first reason is that as mentioned earlier, generative models such as Transformer only support the generation of discrete images, and another model is needed to convert continuous color values into discrete values to be compatible with these models.

The second reason is to reduce the computational complexity of the model. Take Transformer as an example. The number of Transformer operations is roughly proportional to the square of the number of pixels. The computational overhead of using Transformer to generate high-resolution images is unacceptable. And if you use an AE to compress the image, you can use Transformer.

VQVAE has brought three inspirations to subsequent work: First, AE can be used to compress images into discrete vectors; second, if the computational cost of an image generation model to generate high-resolution images is too high, AE can be used to compress the image first. , and then generate a compressed image. These two inspirations correspond to the two motivations for using VQVAE mentioned in the previous paragraph.

The third inspiration is more interesting. In the process of discussing VQVAE, we did not consider fitting at all. This is because after the vector discretization operation, the input of the decoder is no longer the output of the encoder, but the vector in the embedding layer. This approach eliminates the rote learning of AE and alleviates the over-fitting phenomenon. In this way, we can look at VQVAE from another angle: the encoder is still an AE encoder, the output of the encoder is a continuous vector, and the subsequent vector discretization operations and embedding layers are all part of the decoder. From this perspective, VQVAE actually proposes a regularization module consisting of vector discretization and embedding layers. This module, like the KL divergence constraint of VAE, solves the over-fitting problem of AE. We call the regularization method of VQVAE VQ regularization.

The image generation method proposed in the VQVAE paper has a general effect. Like ordinary AE, VQVAE only uses reconstruction error to constrain image quality during training, and the details of the reconstructed image are still very blurry. And the second-stage image generation model supported by VQVAE is not a more powerful Transformer, but an image generation model based on CNN.

Subsequent VQGAN papers improved VQVAE. For the one-stage image compression model, VQGAN introduces some supervision errors in the Generative Adversarial Network (GAN) based on VQVAE to improve the reconstruction quality of the image compression model; for the two-stage image generation model, this method uses Transformer. With these changes, the VQGAN method is able to generate high-quality HD images. Moreover, by inputting additional constraints (such as semantic segmentation images and text) into the Transformer, the VQGAN method can achieve constrained image generation. The following are high-definition images generated by the VQGAN method based on semantic segmentation images.

## The intersection of routes: Stable Diffusion

After reading the above two AE improvement routes, I believe you can already guess the core idea of Stable Diffusion. Let’s see how Stable Diffusion draws inspiration from these two paths.

After releasing VQGAN, Germany's CompVis Laboratory began to explore ways to improve VQGAN. VQGAN can compress the image side length by 16 times, while the Transformer supporting VQGAN can only generate images at once. In other words, the entire method can only generate one picture at a time. In order to generate higher resolution images, the VQGAN method requires the use of sliding windows. Can the model be used to generate higher-resolution images at once? The main factor that restricts the resolution generated by the VQGAN method is the Transformer. If the Transformer could be replaced with a model that was more efficient and could generate higher-resolution images, wouldn't it be possible to generate larger images? CompVis Lab began to focus on DDPM.

Therefore, one year after the release of VQGAN, CompVis Lab released a paper called High-Resolution Image Synthesis with Latent Diffusion Models, proposing an image generation model called latent diffusion model (LDM). By cooperating with the AI company Stability AI and using their huge computing resources to train LDM, CompVis Labs released an open source Vincent diagram AI painting model commercially known as Stable Diffusion.

LDM is actually based on the VQGAN method, changing the image generation model from Transformer to DDPM. Or from another perspective, in order to allow DDPM to generate high-resolution images, LDM takes advantage of the second inspiration of VQVAE: first use AE to compress the image, and then use DDPM to generate the compressed image. The AE of LDM generally compresses the image side length by 8 times. The compressed image generated by DDPM is the image that the entire set of LDM can generate.

Unlike Transformer, the images processed by DDPM are represented by continuous vectors. Therefore, when using VQGAN for image compression in LDM, vector discretization operations are not necessarily required. It is only necessary to add a slight regularization based on AE. The author discussed two types of regularization when implementing LDM, one is KL regularization of VAE, and the other is VQ regularization (corresponding to the third inspiration of VQVAE). Both regularizations can achieve good results.

LDM can still achieve constrained image generation. After replacing Transformer with DDPM, additional constraints will be input into DDPM. The author discusses several ways of inputting constraints into DDPM in the paper.

## Summary

After understanding the early work, it is so simple to understand the core idea of Stable Diffusion. Let us summarize the development process and main structure of Stable Diffusion. Stable Diffusion is developed from two types of AE variants, one is DDPM, which has powerful generation capabilities but requires a lot of computing resources, and the other is VQVAE, which can compress images with higher fidelity. Stable Diffusion is a two-stage image generation model that first uses a VQGAN using KL regularization or VQ regularization to achieve image compression, and then uses DDPM to generate compressed images. Additional constraints (such as text) can be input into DDPM to achieve constrained image generation.

## Related papers

This article only briefly summarizes the early work of Stable Diffusion. To fully understand Stable Diffusion, you need to read more early papers. Let me categorize the early papers according to their importance.

### Must-read articles on image generation

- Neural Discrete Representation Learning (VQVAE): https://arxiv.org/abs/1711.00937
- Taming Transformers for High-Resolution Image Synthesis (VQGAN): https://arxiv.org/abs/2012.09841
- Denoising Diffusion Probabilistic Models (DDPM): https://arxiv.org/abs/2006.11239

### Image generation selected reading articles

- Auto-Encoding Variational Bayes (VAE): https://arxiv.org/abs/1312.6114
- Pixel Recurrent Neural Networks (PixelCNN): https://arxiv.org/abs/1601.0675
*Deep Unsupervised Learning using Nonequilibrium Thermodynamics*: https://arxiv.org/abs/1503.03585*Denoising Diffusion Implicit Models*(DDIM): https://arxiv.org/abs/2010.02502*Classifier-Free Diffusion Guidance*: https://arxiv.org/abs/2207.12598*Generative Adversarial Networks*(GAN): https://arxiv.org/abs/1406.2661*A Style-Based Generator Architecture for Generative Adversarial Networks*(StyleGAN): https://arxiv.org/abs/1812.04948