Mar 1, 2024 5 min read Stable Diffusion

New Model Stable Cascade Part 1

In this February, StabilityAI has once again surprised us. A new generation of text-image generation model, Stable Cascade is introduced. It is based on the novel Würstchen architecture and adopts a three-stage diffusion process. Compared to the original Stable Diffusion XL, Stable Cascade significantly improves its performance and effectiveness, bringing smoother and more precise experiences to the users.

In this article, we will delve into the underlying mechanism and practical application of Stable Cascade. From the theory foundation to actual operation, this article will guide you in a one-stop manner.

Table of Content

Intro. of Stable Cascade
How it works
1. Model Architecture
2. Workflow

Intro. of Stable Cascade

Stable Cascade is a text-image generation model based on the novel Würstchen architecture, which incorporates a three-stage diffusion process. Compared to the previous Stable Diffusion XL, Stable Cascade not only presents increased speed but also improved output. The model and code have been open-sourced now. Both of the training and inference processes are included.

Advantage of architecture

One key differentiation from other models is its simplified latent space design, which greatly reduces the size of the potential space. Smaller potential space means faster inference and reduced training costs.

Stable Diffusion uses a compression ratio of 8, which shrinks a 1024x1024 image to 128x128 encoded dimension. In contrast, Stable Cascade achieves a compression ratio of 42, reducing a 1024x1024 image to only 24x24 hidden space while maintaining the image's clarity.

Performance

After a thorough performance evaluation, Stable Cascade has produced remarkable results in various metrics. Factors under evaluation include image generation quality, inference speed, and efficiency, among others.

It shows high stability and reliability while operating in such highly compressed latent spaces. Moreover, this kind of architecture based on a conditional text model manages to achieve 16x cost reduction compared to the previous version such as Stable Diffusion 1.5.

Stable Cascade not only shows efficiency superiority but also achieves significant results in aesthetics and evaluation. Based on evaluation outcomes, Stable Cascades (30 inference steps) have outperformed Playground v2 (50 inference steps), SDXL (50 inference steps), SDXL Turbo (1 inference step), and Wuerstchen v2 (30 inference steps).

Stable Cascades focuses on efficiency through its architecture and higher compressed latent spaces. Although the largest model has 14 billion additional parameters than Stable Diffusion XL, inference time is still faster, as shown in below:

Flexibility and scalability

The model supports a wide range of extensions such as tweaks, ControlNet, LoRA, etc. It allows users to customize and adjust the model according to application scenarios and requirements.

Open Source

Stable Cascade is an open-source model and code, enabling users to utilize, customize, and expand the model freely. It is a powerful and reliable image-generation solution available to all users.

How it works

Stable Cascades' architecture is based on the Würstchen architecture, employing a three-stage text-to-image diffusion model. Below is an in-depth explanation of the working principle and each stage of this model:

Model Architecture

Stable Cascade features a three-stage model composed of A stage, B stage, and C stage, which collectively represent a chained model for image generation. This is the origin of its name “Stable Cascade.” Stages A and B compress image data like VAE does in Stable Diffusion. However, this configuration allows a higher level of compression than we discussed earlier. Also, Stage C is responsible for generating the 24x24 latent space corresponding to the given text prompt.

The following diagram illustrates the process graphically. Please note that A stage is a VAE and stages B and C are both diffusion models.

For this version, we provide two checkpoint nodes for the C stages, two for the B stages, and one for the A stage. We strongly recommend using the version with 36 billion parameters for C stages since most work is centered on its finetuning process.

Both variants for B stages achieved good results, but the version with 15 billion parameters is more proficient at the reconstruction of small and delicate details. Hence, if you opt for the larger variants of each stage, maximum results will be obtained. Lastly, A stage contains 2000 million parameters and is fixed.

The core of Stable Cascade is formed by three key stages: Stage A, Stage B, and Stage C. These three stages collectively comprise a chained system to generate images, and thus its name, “Stable Cascade.”

Stage A： Stage A is a VAE derived from decomposition autoencoder (DAE). It serves as the compressor for input images. It is similar to the VAE in Stable Diffusion, but the greater compression rate results in a smaller latent space. The key function of Stage A is to encode high-resolution images into a low-resolution representation for the next stages to process.
Stage B： Stage B, likewise, serves as an image compressor stage. In Stable Cascade, it uses a higher compression ratio than Stage A. Its goal is to further compress input images, making the latent space smaller and hence improving the model’s efficiency. Stable Cascade achieves a compression ratio of 42, meaning it can encode a 1024x1024 image into just 24x24 latent space while retaining the image’s fidelity.
Stage C：Stage C is the core of Stable Cascade responsible for generating images based on text prompts. Within the compressed latent space, Stage C generates images according to the input text prompt and preserves image clarity and quality. Unlike stages A and B, Stage C is a diffusion model. It generates images through a series of diffusion processes.

Workflow

The workflow of Stable Cascade can be summarized as a few steps:

Image compression: First, the input image undergoes processing by stages A and B, being compressed into a minuscule latent space. The process enables the model to process images more efficiently.
Text prompt processing: Concurrently, users provide text prompts, which undergo preprocessing and encoding, translating into features compatible with the model.
Image generation: Within the latent space, Stage C sequentially generates images based on the input text prompts. The process involves multiple diffusion steps, adding more and more details and info with each step. Ultimately, clear, high-quality images are generated.

Through this workflow, Stable Cascade effectively generates high-quality images according to text prompts provided by users, while maintaining high efficiency and inference speed. Its unique architecture and workflow allow it to achieve impressive feats in the field of image generation.