The CFG Scale in Stable Diffusion
Last time we went through the basics of Stable Diffusion, we can expand to explain a bunch of concepts. Today, we will continue with one of the pillars: The Classifier-Free Guidance (CFG), which controls how closely a prompt should be followed during the sampling process in Stable Diffusion.
Table of Contents
What is CFG?
The CFG is short for classifier-free guidance, to understand this concept, we need to know what its predecessor — classifier guidance is.
Classifier Guidance
During the diffusion process, we can use a label to guide the process. In a sense, classifier guidance is a way of putting image labels in diffusion models.
Therefore, the classifier guidance scale means using a parameter to control how closely the diffusion process should follow the mentioned labels.
Suppose there are three sets of pictures, respectively labeled 'cat', 'dog', and 'human'. If the diffusion process is unguided, then the diffusion model will sample uniformly among these three sets. Sometimes, such behavior will lead to an output image containing the contents of two labels, such as a human petting a dog.
With high classifier guidance, the image generated from the diffusion model will lean toward examples that are extreme or unambiguous under some labels. If you ask the model to draw a cat, it will return an image that is undoubtedly a cat and contains nothing else.
The classifier guidance scale controls how close the diffusion process is to the guidance target. In the image above, the sample on the right has a higher classifier bootstrap strength than the sample in the middle. This scale value is just a multiplier of the drift variable in the diffusion model calculation.
Classifier-free guidance
Although the classifier guidance brought a record-breaking performance elevation for the diffusion model. It still needs an extra model to provide such guidance, which brings difficulties for the model training.
Therefore, a method of classifier-free guidance was raised. The classifier-free guidance does not use images and labels as training content, but images and their text description.
Hence there is no need for another classifier to train and judge the matching degree between images and labels to create more positive image generation models.
By directly describing the matching degree between training generation and image and text, the result is an overall image generation model with a text-to-image function.
Classifier-free guidance scale
The CFG principle is based on the concept of resonant frequency. In certain diffusion problems, a resonant frequency can be found. By applying an excitation signal at this frequency, the transmission properties of diffusing substances, such as diffusion rate, diffusion coefficient, etc., can be significantly changed.
Therefore, we use the CFG scale to control how much the AI-generated images should follow the guidance. Consider the prompts:
a cat, a dog, a human.
- If the CFG scale is set to -1, the prompt will be ignored. Then the chance of generating a cat, a dog, or a human is equal.
- If the CFG scale is set between 7 and 10, a moderate range. Then you will always get a cat.
- If the CFG scale is set to high, the images of the cat will become unambiguous. But this is not always the case.
Which CFG Scale is the best?
Now you may wonder which CFG scale is the best to generate the image of your ideation.
The reasonable values generally lie between 7 and 10, but as I was saying, this is not always the case, you might find the best value during multiple tries.
Consider the prompts below:
black cat, yellow eyes, cowboy hat, mechanical ears, desert, night
- If the CFG scale is set to 1, the picture is not shaped and too random. When it is about 3, our main keywords can be generated.
- A CFG scale between 7 and 10 is relatively safe, you can increase or decrease it according to the situation.
- The CFG scale set to over sixteen or even twenty is generally not recommended unless the prompt is very detailed. It might affect the coherence and quality of the diffusion process.
In general, the CFG scale tradeoff between accuracy and diversity. Low CFG value brings more diverse images, sometimes it is creative but probably too twisted to follow the prompt, for short prompts, it can be interesting and useful. A high CFG value gives you more accurate output, but the higher the better does not apply to all the cases.