The basics about Stable Diffusion XL
On 26th July, Stability AI The world's most powerful Vincentian graph model SDXL 1.0, which everyone has been waiting for, is open sourced.
SDXL 1.0 has the following new features:
- Better imaging quality. SDXL is capable of producing high-quality images in almost any artistic style, and SDXL 1.0 is a huge improvement over SD v1.5 and SD v2.1 (and even over SDXL 0.9). Blind testers rated the images as the best in overall quality and aesthetics across styles, concepts and categories.
- More Art Styles. SDXL v1.0 is able to implement more styles than its predecessor and "knows" more about each style. You can experiment with more artist names and aesthetics than before. SDXL 1.0 is particularly good for vivid, accurate color, with better contrast, lighting and shadows than its predecessor, with quality comparable to Midjourney's latest version.
- Smarter, simpler language. SDXL requires only a few words to create complex, detailed, and beautiful images. Users no longer need to invoke qualifiers such as "masterpiece" to obtain high-quality images. Additionally, SDXL is able to understand conceptual differences such as "Red Square" (a famous place) and "Red Square" (a shape).
- Higher resolution. With a base resolution of 1024 x 1024, SDXL 1.0 produces better image detail than its predecessor, while SDXL 1.0 handles aspect ratios better.
- Largest open image model. SDXL 1.0 has one of the largest parameter counts of any open source Vincent graph model and is built on an innovative new architecture consisting of a 3.5B parameter base model and a 6.6B parameter refiner. The full model consists of an expert mixing pipeline for latent diffusion.
- Fine-tuning and advanced control. With SDXL 1.0, fine-tuning models to fit custom data is easier than ever. Custom LoRAs or checkpoints can be generated with less data wrangling. The Stability AI team is building T2I/ControlNet specifically for SDXL to build the next generation of task-specific structure, style and composition control.
Stability AI tested SDXL 1.0 against various models, i.e., conducted a preference test on Discord for these generations of Stable Diffusion models, and it can be seen that people prefer images generated by SDXL 1.0.
The preference rate for the SDXL 0.9 version a month ago was 24.4%, which was lower than the 26.2% preference for the new SDXL 1.0 version.
Table of Contents
How to use SDXL 1.0
Using SDXL 1.0 online
Currently, Stability AI provides several methods for using SDXL 1.0 online:
- Experience SDXL 1.0 on Clipdrop: Clipdrop.
- Using SDXL 1.0 via API on Stability AI Platform: Stability AI Platform.
- Get SDXL 1.0 on AWS Sagemaker and AWS Bedrock: AWS Sagemaker and AWS Bedrock.
- Join the Stable Foundation Discord for real-time testing of SDXL models: Stable Foundation Discord.
- DreamStudio also provides SDXL 1.0 for image generation: DreamStudio
We recommend the first Clipdrop and the fourth using SDXL 1.0 in the discord community. These two methods are considered the simplest and most trouble-free solutions.
Tips for Using SDXL 1.0
The use of SDXL 1.0 in WebUI is not much different from the previous method based on the SD 1.5 model. It still uses prompt words and reverse prompt words to generate text images, and img2img is used to generate image images. It should be noted that since the basic model of SDXL is set up at 1024x1024 resolution, there are three points that need to be paid attention to in the setting:
- Resolution: 1024 Width x 1024 Height (or larger)
- Number of sampling steps: 30
- Sampling method: DPM++ 2M Karras (or other DPM++ sampler)
Resolution
SDXL is trained on 1024 x 1024 images. While it is possible to generate images at 512x512 resolution, the results are of lower quality and will suffer from distortion. For best results, both height and width should be kept at 1024.
SDXL 1.0 also supports resolutions beyond 1024. Here are some common resolutions that use 1024 as the minimum you can try:
- 16:9 (cinematography) 1820 width x 1024 height
- 3:2 (professional photography) 1536 width x 1024 height
- 4:3 (normal image) 1365 width x 1024 height
Something to note here is that the more "stretched" your aspect ratio is, the greater the chance of distortion and mutation.
Number of sampling steps
In the previous model based on SD 1.5 fine-tuning, generally setting the number of sampling steps to 20 can obtain good pictures. However, in SDXL, if the number of sampling steps is set to 20, it will give people a feeling that the picture is not fine enough, giving people a feeling that the painting is not finished.
Therefore, the number of sampling steps can be appropriately increased to obtain better picture quality. Of course this requires a better graphics card and a larger memory. How much to adjust depends on the computing power of your local graphics card.
Here is a comparison of the effects of using 30steps-60steps:
Sampling method
Sampling methods are more important for SDXL 1.0 than for previous versions. It is recommended to use any DPM++ sampler, especially DPM++ with Karras sampler. Such as DPM++ 2M Karras or DPM++ 2S a Karras.
Use SDXL Refiner to generate graphs
Refiner refines the model to add more details and make the image quality clearer. It is best used for photorealistic generation. Actually it is possible to use it by using img2img function and low denoising (low intensity) in WebUI.
The following are 4 pictures generated using different redrawing amplitudes. Through comparison, it can be seen that although the image generated by the 0.75 redrawing amplitude has the better definition, compared with the original image. But it is not the same picture at all. Remember, higher is not better, and too high a redrawing range will often produce many unreasonable images, which will make the redrawn image completely unusable.
Aesthetic Scores
Aesthetic Score is a new setting that comes with SDXL models. This is only used to refine the model. Each image in the SDXL training data has an aesthetic score, with 0 representing the ugliest and 10 representing the best-looking. By setting your SDXL aesthetic score to a high aesthetic score, you bias the prompt towards images with that aesthetic score (theoretically improving the aesthetics of your image).
Suppose during training we have a lot of poor quality images, for example, a red apple on a table made of rough wood. Models may start to associate red apples and rough wood tables with "bad quality" art simply because there are so many low quality examples out there. In this case, a high-scoring painting that also happens to feature red apples and a table made of rough wood may not fully capture its features and only produce a poorer quality red apple.
This is where the aesthetic score comes into play. Because it is part of the "text embed" it is considered part of the title. The model “takes into account” the aesthetic rating of each painting during learning and uses the given rating during inference to place less emphasis on features of low-scoring paintings.
This way, the output won't be too biased towards a few (or even many) "bad" examples. Therefore, the quality of the data set used for subsequent fine-tuning does not affect the AI's ability to create beautiful work as much as you might expect.
SDXL's low aesthetic score is a bit confusing. It sets a "bias" for negative cue words. Typically, this bias is expected to be similar to images with low aesthetic ratings. So this is the opposite of a high aesthetic score. The lower this value, the better your image will look and vice versa.
Custom models with Lora
Unlike the basic model of SD1.5, which is trained with a size of 512x512, the basic model of SDXL 1.0 is trained with images of a size of 1024x1024. Therefore, this means that all models and Lora previously obtained based on SD1.5fine-tunes are invalid and cannot be used based on SDXL 1.0.