Understand the basic concepts of Stable Diffusion in 10 minutes

Understand the basic concepts of Stable Diffusion in 10 minutes

As the saying goes, sharpening your knife will make you a good woodcutter. Only by sharpening your knife can you easily chop wood and successfully chop the required firewood. The same is true for chopping firewood, and the same is true for doing things in daily life. If we want to do something well, we must first lay a good foundation, find out the core aspects, understand the relevant principles, and clarify the interests involved. Then we can use them to obtain satisfying results.

This article will introduce you to some basic concepts of Stable Diffusion. The relevant instructions will be as easy to understand as possible while remaining correct, hoping to help confused students.

Table of Contents

  1. The Model
  2. Stable Diffusion
  3. Latent Space
  4. Training
  5. Vector
  6. Generation
  7. VAE Model
  8. Lora Model
  9. Prompt
  10. Negative Prompt
  11. CFG Scale
  12. Sampling Method
  13. Seed

The Model

Understanding the word 'model' requires a little knowledge of junior high school mathematics. You can think of each model as a mathematical function: y=f(x), input parameter x, and get the return value y. It's just that this function may be very complex, with many sub-functions in it, and these sub-functions are connected through various rules.

When using Stable Diffusion to generate pictures, we will encounter many models. The most basic one is the Stable Diffusion large model, such as Anything, realisticVision, etc. For such models, we can simply think that the parameters of the model are prompt words and pictures. Size, prompt word guidance coefficient, random number seed, etc., the return value is the image data.

Stable Diffusion AI: What it is and How it Works

In addition to large models, Lora models and embedded models are commonly used. The Lora model is a fine-tuned model that can be used to generate certain characteristic pictures, such as some mecha and illustration-style models. Embedded models usually use a single input to represent a set of data, which can save some configuration, such as the commonly used negative cue word model EasyNegative.

Other than the models supported by Stable Diffusion itself, many plug-ins also require models to function properly. For example, ADetailer requires downloading face and hand models.

There are many types of suffixes for model files. For example, the commonly used suffixes for large Stable Diffusion model files are generally .safetensors and .ckpt. .safetensors files are safer and faster than .ckpt files. It is recommended to use this when available. There are also some models whose suffixes are .pt, .pth, .onnx, etc. They are generally trained by different machine learning frameworks or contain different information formats.

Stable Diffusion

Let’s first get to know these two words:

Diffusion originally refers to the transfer process of molecules from a high concentration area to a low concentration area. Stable Diffusion has two types of diffusion, forward diffusion and backward diffusion.

  • Forward diffusion is to continuously add noise to a clear image, making it blurry and turning it into a noise image. The distribution of this noise needs to conform to the statistical characteristics of natural images, making the feature data of the image more widely and evenly distributed, thereby increasing the variation and diversity of the generated images and making them closer to real images.
  • Backward diffusion describes the transformation process from a noisy image to a clear image. As the diffusion steps increase, the noise is gradually removed and the image becomes clearer. Denoising means removing pixels that should not appear in the image.

In a sense, Stable means stable. Regardless of the input text prompt words, the trained model can stably generate images that match the text prompt words, and the quality of these images is usually stable without obvious fluctuations.

Latent space

Stable Diffusion training and image generation are not based on the image pixel space but on a potential space. This potential space is much smaller than the image pixel space. For example, an image with a size of 512*512 has a pixel space of 3*512*512 matrix, while the latent space of the Stable Diffusion model is a 3*64*64 matrix. The difference between the two is 64 times. It can be seen that the latent space significantly reduces the requirements for memory and computing power, training, and generation.


Training refers to the process of building a generative model using images with text descriptions.

  • The training program first converts the text description into a vector in the latent space through an encoder. Each dimension of the text description vector represents a feature or attribute of the text description, such as color, shape, size, etc. At the same time, the training program also transforms the original image into the latent space;
  • The noisy image is then progressively generated based on the text description vector using the “forward diffusion” method mentioned above;
  • Then use the "backward diffusion" method mentioned above to remove the noise in this noisy image and try to restore the vector representation of the original image;
  • Compare the obtained vector representation of the image with the vector representation of the original image, calculate the loss value, and then adjust the parameters of the model to minimize the loss.

This process is repeated multiple times until the loss value can no longer be reduced or can be reduced to a very small extent; at the same time, the vector of text description conversion will continue to approach the information of the original image through learning. The result is a relatively high-quality model.


It is a quantity with magnitude and direction, such as acceleration. Vectors can also be used to represent a set of values, such as the pixel value of an image, the word frequency of text, etc. Using vectors can conveniently operate and calculate data, such as calculating the angle between two text vectors to obtain their similarity.


Generation refers to the process in which the model user uses text prompt words + the model to generate pictures.

When generating, you still need to convert the text prompt words into vectors in the latent space. The "forward diffusion" method is then used to generate a noise image based on this vector that conforms to the vector semantics. Then use the "backward diffusion" method to remove the noise in this noisy image, and finally get a clear picture.

VAE Model

As mentioned above, when Stable Diffusion generates images, it will be processed in a latent space. The conversion of image data and latent space data is handled by the VAE model. During this conversion process, the VAE model can learn how to process some image features, making the generated images more colorful and with sharper details, which can be used to solve the problem of gray and blurry images.

Many large models will come with their own VAE model. At this time, we do not need to match it with a VAE. Of course, some models do not come with their own VAE model. In this case, you need to match one. In Stable Diffusion WebUi, you can choose "Automatic Recognition" without any thought.

For Stable Diffusion WebUI, the default installation directory for Vae models is: <stable-diffusion-webui>\models\Vae. Please replace <stable-diffusion-webui> with your own installation directory. However, this directory can be changed through startup parameters. Please confirm your own Lora directory.

Lora Model

This is a style model based on a large model, which means that it cannot be used alone and must be matched with a large model. For example, when we draw a young lady, we can use some Lora models to control the character's clothing and headgear; when generating mechanical limbs, we can use some Lora models to enhance the mecha style covering the limbs; when drawing landscapes, we can use Some Lora models to control the style of painting.

The training cost of this kind of model is not high, and it often only requires dozens of pictures to train, making it more convenient for ordinary users to try it out.

There is a hypernetwork model that is similar in function and usage to the Lora model, but its effect is worse than the Lora model, and there are very few such models in the community.

For Stable Diffusion WebUI, the default installation directory of the Lora model is <stable-diffusion-webui>\models\Lora. It is empty in the initial state. Please replace <stable-diffusion-webui> with your own installation directory. However, this directory can be changed through startup parameters. Please confirm your own Lora directory.


The description of the image is what kind of picture you want to draw. For example, I used: a girl above, but this prompt word is too simple. Although the AI ​​draws a girl, it does not know what the girl in your mind looks like. If you want to draw a girl that better meets your needs, you have to It would be better to tell him more details, such as what color the girl's hair is, what clothes she is wearing, whether she is standing or sitting, whether she is outdoors or indoors, etc.

Prompts are particularly important in AI painting. A common writing routine is style, subject, and details. Style is whether the picture is a photo, ink painting, or line drawing; the subject is what you want to draw, such as a girl, a tree, or a dog; the details are what your subject looks like, whether it has long or short hair, or black hair Or pink hair, blue or yellow eyes, etc.

In addition, we can also use some quality prompt words to improve the quality of the picture, such as masterpiece, best quality, ultra-detailed, 8K, etc. However, this has no meaning for the new version of SDXL. SDXL will strive to do its best; also You can use weight numbers to strengthen or weaken the influence of a certain prompt word on the generated picture, for example: red hair:1.31, which can strengthen the probability that the hair color in the output photo is red.

Negative Prompt

Things that I don’t want to appear in the picture, such as trees, tables, 6 fingers, missing arms and legs, etc., in the above example I used "EasyNegative", which is the code name of an embedded model, which can be considered to represent For some common reverse prompt words, you don’t need to enter them one by one, and only use the amount of one prompt word.

CFG Scale

In Stable Diffusion, the CFG Scale affects the noise image. By adjusting this parameter, the correlation between the noise image and the original text can be changed. The smaller the value of this parameter, the more freely the generated image will be, and the larger the value, the more emphasis will be placed on the influence of text information.

Sampling Method

Sampling is all about removing noise. When introducing Stable Diffusion above, we mentioned the concept of "reverse diffusion". The sampler is an algorithm program used for denoising during reverse diffusion.

Stable Diffusion WebUI has integrated many samplers by default, such as Euler a, DPM++, DDIM, Restart, etc. They have certain differences in the effect and speed of image generation. for example:

  • Euler a: This is the one selected by default. It is the simplest and fastest sampler. It can produce great diversity with a small number of steps. Different steps may have different results. However, when the number of steps is too high (more than 30 steps), the quality of the image may not improve.
  • DPM++: performs well in terms of image quality, but compared to other samplers, its time consumption may increase accordingly.


Random seeds are one of the charms of AI painting. When other parameters are the same, as long as the random numbers are different, slightly different pictures will be generated each time, and the creativity is almost endless; conversely, using the same random numbers can generate the same picture, which is why many people The reason why I never tire of collecting Stable Diffusion graphing parameters. -1 means using a different random number each time.

Inside Stable Diffusion, a random number seed affects the noisy image in the latent space. When other parameters are the same, the same random number seed will generate the same noise image and therefore the same pixel image.