What is ControlNet?

What is ControlNet?

On the eve of Valentine's Day, a new plug-in for Stable Diffusion named ControlNet was released, sparking a flurry of excitement in the AI art community.

It quickly gained acclaim, with many netizens praising its ability to bring AI generation technology to a level of increased control and precision. Some even went so far as to say that ControlNet marks a new era for AI generation, where the technology is better integrated with design and production, making it truly useful.

Table of Contents

  1. What is ControlNet?
  2. The models of ControlNet
    1. Object type
      1. OpenPose
  3. Outline type
    1. Canny hard edge
    2. MLSD straight line
    3. Lineart line draft
    4. Soft edge
    5. Scribble
  4. Depth of field
  5. Redraw type
    1. Inpaint
    2. Tile
    3. InstructP2P
  6. Summary

What is ControlNet?

ControlNet is a lightweight pre-training model for Stable Diffusion, built on version 1.5. In the demo paper by developer Zhang Lumin, he explains that ControlNet controls the large-scale diffusion model through input conditions, using edge, depth, or skeleton features in input images in combination with text prompts to accurately guide image generation results in Stable Diffusion 1.5.

The game changer gives Stable Diffusion users to more conditioning power to the text prompt. ControlNet is used to control composition, and LoRA is used to control style. However, there is a little addition to this sentence: the ControlNet framework is so powerful that it feels like it will gradually become omnipotent, a very powerful style transfer model shuffle has been released, which will be introduced in detail later.

Let's see what ControlNet can do.

The upper two images are users try combining jumping pose with AI image generator. Without the ControlNet, we can say the upper right corner one has nothing to do with the original picture. Meanwhile, after using ControlNet to learn the skeleton features of human posture, a girl in the same pose as the original picture.

The models of ControlNet

There were only 8 source models of ControlNet at early stage, including: depth, hed, normal, scribble, seg, openpose, MLSD, and canny. After six months of iteration, it has been updated to 14 models. Each model has its own characteristics. It is a bit difficult for novices to completely remember each one. Therefore, here we divide the model into 4 types according to the control direction of the model, namely object type, outline type, depth of field type and Redraw type.

Object type

The object type mainly used as the key model of OpenPose, it was the only model that used to control human limbs and facial expressions.


OpenPose is equipped with ability to detect the key points of human body structure, such as the positions of the head, arms, knee, shoulders etc., and ignores the detailed elements of the hair, background. It captures positions of the body structure to restore the character's posture and expression.

In OpenPose, the model provides 5 built-in preprocessor for human body structures, including: openpose, face, faceonly, full and hand. They are used to detect human body structures such as facial features, limbs, and hands respectively.

The openpose is the most basic preprocessor, which can detect most key points of the human body and connect them through lines of different colors.

Based on openpose, face has strengthened the recognition of human faces. In addition to the basic facial orientation, it can also recognize facial features such as eyes, nose, mouth, and facial contours. Therefore, face can achieve a good restoration of human expressions.

Faceonly only processes facial contour point information, and is suitable for scenes that only depict facial details.

Hand is equivalent to adding the description of hand structure on the basis of openpose, which can well solve common hand deformation problems.

The full is a collection of all the above preprocessing functions, and can be said to be the most comprehensive preprocessor. It is very troublesome to write down the above processors one by one. For daily use, it is recommended to directly select the full preprocessor that includes all key point detection.

In addition to basic human postures, the preprocessor contains structural information about the characters' facial features and hands, so OpenPose is a very good graphics control tool for processing human body details such as expressions and hands.

Outline type

The outline class refers to limiting the content of the screen through the outline of elements. The outline class models include Canny hard edge, MLSD straight line, Lineart real line draft, Lineart_anime animation line draft, SoftEdge soft edge, Segmentation semantic segmentation, and Shuffle random shuffling.

Each model is equipped with a corresponding preprocessor. Due to differences in algorithms and versions, the same model may provide multiple preprocessors for users to choose.

Canny hard edge

The Canny hard edge has a wide range of use, it was hailed by the author as one of the most common ControlNet models. This model is derived from edge detection algorithms in the field of image processing, and can identify extract edge features in images and transfer them to new images.

Canny edge only contains canny preprocessor. As we can see in the picture above, canny can accuruately extract line drawings at the edges of elements in the picture, and can accurately restore the content layout in the picture even when drawing with different master models.

MLSD straight line

The second control type is MLSD straight line. MLSD extracts all straight-line edges in the picture. In the figure below, you can see that after MLSD straight line detection preprocessing, only the straight-line features in the picture will be retained, while the curve features will be ignored.

Therefore, MLSD is mostly used to extract linear geometric boundaries of objects, most typically in fields such as geometric architecture, interior design, and road and bridge design.

Lineart line draft

For the Lineart line draft, it also extracts line drawings from the edges of images, but its usage scenarios are more subdivided, including Realistic and Anime.

In the ControlNet plug-in, both lineart and lineart_anime control models are placed under the "Lineart" control type, which are used for realistic and animation image drawing respectively. There are as many as 5 supporting preprocessors, among which The preprocessor with the anime field is used for feature extraction of animation-like images, while the others are used for realistic images.

Soft Edge

Soft Edge is a special edge line drawing extraction method. Its characteristic is that it can extract edge lines with a gradient effect. The resulting drawing will look softer and have a natural transition.

There are 4 different preprocessors provided in SoftEdge, namely HED, HEDSafe, PiDiNet and PiDiNetSafe.

In the official performance comparison, the model stability is ranked as PiDiNetSafe > HEDSafe > PiDiNet > HED, while the highest result quality is ranked as HED > PiDiNet > HEDSafe > PiDiNetSafe.

HED HEDSafe PiDiNet PiDiNetSafe


Scribble, also known as Sketch, is also a type of drawing control with hand-drawn effect, but the pre-processed drawing generated by it is more like a line drawing of crayon graffiti, and it is more free in controlling the drawing effect.

Scribble also provides 3 different preprocessors to choose from, namely HED, PiDiNet and XDoG. From the figure below, we can see the image effects of different Scribble preprocessors. Since HED and PiDiNet are neural network algorithms, and XDoG is a classic algorithm, the contour lines detected by the first two are thicker, which is more in line with the hand-drawn effect of graffiti.

Depth of field

The previous contour classes are all image detection from two-dimensional plane angles. Is there any control type that can reflect the three-dimensional level?

This has to mention the depth-of-field ControlNet model. The term depth of field refers to the distance between the object in the image and the lens. Simply put, this type of model can reflect the front-to-back relationship between elements, including Depth depth and NormalMap normal map, two old models.

There are four preprocessors for Depth: LeReS, LeReS++, MiDaS, and ZoE. In the figure below, we can see the detection effects of these four preprocessors.

In comparison, the depth map detail extraction levels of LeReS and LeReS++ are relatively rich, among which LeReS++ will be better.

MiDaS and ZoE are more suitable for processing complex scenes. ZoE has the largest number of parameters, so the processing speed is slower. The actual effect is more likely to enhance the front and rear depth of field contrast.

Depending on the preprocessor algorithm, Depth also differs in the final image. In the following case, you can see that the MiDaS algorithm can perfectly restore the depth of field relationship in the scene. In actual use, you can judge which one based on the preprocessed depth map. Deep relationship presentation is more appropriate.

Redraw type

Finally we get to the Redraw type. The redrawing control of images in ControlNet is more subtle. We can understand that this type of redrawing model is an extension and expansion of the native image redrawing function.


Inpaint in ControlNet is equivalent to replacing the original image generation algorithm. When used, it is still affected by parameters such as redrawing amplitude.

Inpaint in ControlNet is equivalent to replacing the original image generation algorithm. When used, it is still affected by parameters such as redrawing amplitude. As shown in the case below, we use a lower redraw range, which can achieve a good conversion effect of real-life avatars into two-dimensional images, and can more accurately restore the hairstyle and hair color of the characters in the original image.


Tile can be said to be the most powerful model in the entire redrawing category. Although it is still in the experimental stage, it still brings many surprises to the community, so it is worth our time to introduce it.

In the past half year since SD was open source, drawing super-resolution high-definition large pictures has been the pursuit of many geek players. However, due to the high price of graphics cards and computing power bottlenecks, it has always been difficult to achieve satisfactory goals through the method of drawing directly through WebUI.

Later, smart developers thought of the Tile block drawing method. The principle is to cut the super-large image into small-size images that the graphics card can support, draw them one by one, and then splice them into a complete large image, although the drawing time is lengthened, but it greatly improves the upper limit of graphics card performance, and truly realizes the operation of drawing high-definition large images with a small memory graphics card.

Original Tile

Tile also provides three types of preprocessors: colorfix, colorfix+sharp, and resample, which respectively represent fixed color, fixed color + sharpening, and resampling. In comparison, the default resample will provide more room for play when drawing, and the content will be more different from the original image.


The full name of InstructP2P is Instruct Pix2Pix, which uses ControlNet trained on the Instruct Pix2Pix data set. Its function can be said to be basically the same as that of the original image. It will directly refer to the information characteristics of the original image for redrawing, so it can be used directly without the need for a separate preprocessor.

In the figure below, for the convenience of comparison, we reduce the redrawing amplitude to 0.1. We can find that the rendering effect after turning on InstructP2P can retain more useful details than the pure rendering.

Original P2P InstructP2P

InstructP2P is still in the experimental stage and is not a mature ControlNet model. It is not commonly used and everyone only needs to have a general understanding of its functions.


Random shuffling is a very special type of image control. Its function is equivalent to randomly shuffling all the information features of the reference image and then reorganizing it. The generated image may be different from the original image in terms of structure, content, etc., but in terms of style You can still see a link.

Original Shuffle Output Image

Unlike other preprocessors, Shuffle is completely random when extracting information features, so it will be affected by the seed value. When the seed value is different, the preprocessed image will be strange.

There are not many usage scenarios for Shuffle, because its control stability can be said to be the worst among all ControlNet models. You can think of it as an alternative card drawing artifact, and it is also a good choice to use it to get inspiration.


To sum up, before the emergence of ControlNet, we never knew what AI could generate for us before generating pictures, which was as uncomfortable as drawing a card.

After the emergence of ControlNet, we can accurately control image generation through models, such as uploading line drawings and letting AI help us color and render, control the posture of characters, generate line drawings from pictures, etc.

It’s amazing. You may have seen some pictures generated by line drawings on the Internet, which uses ControlNet, one of the must-install plug-ins for Stable Diffusion.