New Model Stable Cascade Part 3

New Model Stable Cascade Part 3

Stable Cascade offers codes from scratch on how to train Stable Cascade, micro-tune, ControlNet, and LoRA. We can locate the guide on how to operate in the training files folder. Notes: The code repository is in the early development stage. You might encounter unexpected errors or unoptimized training and inference codes.

Next comes a guide on how to use Stable Cascade’s training code and download its required model guide. Especially, it includes training scripts for various use cases, such as Image-to-text, ControlNet, LoRA, and image reconstruction. Note: Stable Cascade utilizes Stages A and B for image compression and uses Stage C for text-conditional learning. Hence, one only needs to train LoRA or ControlNet for Stage C.

Table of Contents

  1. Basic Settings
  2. Start Training
  3. Dataset Transformation
  4. Text-to-Image Fine-Tune
  5. ControlNet Training
  6. LoRA Training
  7. Image Reconstruction Training
  8. Summary

Basic Settings

Stable Cascade offers a training configuration folder, which houses all configuration files for training. These files follow a similar structure and include the essential settings required. For instance, you should set the running name, checkpoint- and output file directory, and training version:

experiment_id: stage_c_3b_controlnet_base
checkpoint_path: /path/to/checkpoint
output_path: /path/to/output
model_version: 3.6B

If you want to use Weights & Biases for logging, you can also set up related information:

wandb_project: StableCascade
wandb_entity: wandb_username 

Next defined are the training parameters such as learning rate, batch size, image size, etc.:

lr: 1.0e-4
batch_size: 256
image_size: 768
multi_aspect_ratio: [1/1, 1/2, 1/3, 2/3, 3/4, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 9/16]
grad_accum_steps: 1
updates: 500000
backup_every: 50000
save_every: 2000
warmup_updates: 1
use_fsdp: False
  • FSDP: Stable Cascade offers an FSDP (fully shared data parallelism) option for PyTorch users. It allows data to be accessed simultaneously in multi-GPU settings.
  • Multi-aspect-ratio: You can set the desired aspect ratio in the multi_aspect_ratio list.
  • EMA: The EMA (exponential moving average) model in diffusion models can significantly improve performance. It is enabled by setting specific parameters (or neglected if not needed).
ema_start_iters: 5000 
ema_iters: 100 
ema_beta: 0.9
  • Dataset configuration: Stable Cascade adopts webdataset to handle data sets. You may specify multiple data set paths, whether they are located on Amazon S3 cloud storage or local drives. Each training type and data set often has specific content as well. Details will be discussed later in the guide.
  - s3://path/to/your/first/dataset/on/s3   
  - file:/path/to/your/local/dataset.tar

Start Training

To proceed with actual training, first navigate to the root storage directory before running the respective training command, such as:

#python3 training_file training_config 
# For example, the command to train LoRA model might appear as:
python3 train/ configs/training/finetune_c_3b_lora.yaml  

Additionally, a bash script is provided for slurm, which is a cluster management system. Note that this assumes you are able to access the cluster where slurm is run as the cluster manager.

Dataset Transformation

As previously mentioned, Stable Cascades code utilises webdataset, which enables the easy handling of large amounts of data. If you wish to fine-tune or train LoRA or ControlNet, your data set may not be in webdataset format. For this, a basic example is provided, showing how to convert your data set to an appropriate format.

  1. Place all images and subtitles in a single folder.
  2. Rename them with IDs of the same numbers. Example: 0000.jpg and 0000.txt, 0001.jpg and 0001.txt, and so on.
  3. Run this command: tar --sort=name -cf dataset.tar dataset/ or create the tar file manually from the folder.
  4. In the config file, specify webdataset_path: file:/path/to/your/local/dataset.tar

If your large data set contains additional image information, such as variables you wish to filter out, then certain configuration settings might come in handy. You can apply a data set filter in the config file as shown below:

  - ['aesthetic_score', 'lambda s: s > 4.5']
  - ['nsfw_probability', 'lambda s: s < 0.01'] 

If your data set contains additional information in its JSON format, such as keys related to aesthetic_score and nsfw_probability. In such cases, the data set will also include 0000.json, 0001.json, 0002.json, and 0003.json files.

Text-to-Image Fine-Tune

To fine-tune any model, first acquire its pre-trained model. Instructions on downloading models are detailed in the user guide's Download Models section or refer to the official documentation. Once completed, update the checkpoint path in the config file.

For Stage C fine-tuning on your own dataset, you can use the following configuration. All crucial parameters are explained in depth in the previous sections, so they will not be explained here again. Interested readers can refer to the config files of micro-tuning 3.6B Stage C and 1B Stage C.

ControlNet Training

Training ControlNet demands additional parameters to be configured, along with required specific ControlNet filters. Such filters can be ones that execute edge detection such as Canny or human pose detection, to name a few. An example config follows:

controlnet_blocks: [0, 4, 8, 12, 51, 55, 59, 63]
controlnet_filter: CannyFilter
  resize: 224 

Here we need to offer more details about Stage C's architecture. It is basically a sequence of residual blocks working at the same potential resolution. It does not utilize UNet. That is the purpose of controlnet_blocks. It identifies positions where you wish to inject control information. This architecture is different from that commonly used in Stable Diffusion, which creates a complete copy in UNet's encoder. The ControlNet approach is simpler and holds the massive advantage of working with fewer parameters.

Next, you define an image filter and extract the information you wish to condition Stage C with (Canny edge detection, human body detection, etc.) through the controlnet_filter parameter. In the example, we utilize CannyFilter defined in the file. This is where you may add your own ControlNet filters. At last, controlnet_filter_params is simply to set your controlnet_filter class additional parameters. That's it. You can reference ControlNet configurations for Inpainting / Outpainting, Face Identity, Canny, and Super Resoluion examples.

LoRA Training

LoRA training on Stage C requires some extra training parameters to be configured as follows:

module_filters: ['.attn']
rank: 4
  - ['[fernando]', '^dog</w>'] 

The extra parameters include module_filters (defining the modules for the LoRA training layer), rank, and train_tokens (dedicated to learning specific tokens). In the above example, a new token named Fernando is created, using the average value of all tokens containing the word 'dog' to initialize its value. You may refer to the sample code for the details.

Image Reconstruction Training

Concentrate more on training Stage B as it shoulders the bulk of the load during compression, while Stage A only performs minimal compression. Using Stage A allows Stage B to train and infer faster, and more efficiently. Combined with Stage A, Stage B can perform in a smaller space. To try achieving a higher compression ratio, or tweaking specific content, you can refer to the training config for Large Stage B or Small Stage B.


Exploring and practicing Stable Cascade's training process is not just a technical challenge, but an expedition to gain deeper insight into the capabilities of contemporary artificial intelligence. From fine-tuning pre-trained models to delving into specific ControlNet or LoRA training, every step showcases the monumental potential of deep learning in the fields of image and text processing.

Here's to hoping this article provides valuable insights to readers of all levels of experience or expertise. Whether you are a veteran in the field or a newcomer, Stable Cascades offers a wide experimental platform to explore, learn and eventually advance the field.