Stable Diffusion: The Science Behind AI Art

This article explores Stable Diffusion, the AI modeling technique that enables AI art.
By Boris Delovski • Updated on Mar 26, 2024
blog image

Diffusion models in general have emerged as a fundamental component of generative AI, particularly within the realm of Computer Vision. They exhibit a significant advantage over other simpler models excelling in both the diversity and quality of the images they generate. In this article, our focus is on Stable Diffusion. This is a unique diffusion model that enables the generation of images through the use of text prompts, image prompts, or a combination of both. These prompts specify the desired output, making the interaction with the model highly intuitive and accessible for everyone.

What Is Stable Diffusion

Stable Diffusion is a diffusion model first launched in 2022. It is a model that users primarily use to generate new, original images from text prompts, image prompts, or a combination of both. However,  it can also be used for other tasks such as:

  • image upscaling - increasing the size of the original image 
  • inpainting - used for image restoration and for adding objects to images
  • outpainting - extending beyond the canvas of an existing image

Stable Diffusion was developed through the combined efforts of one university, and two famous generative AI companies:

  • CompVis Group at Ludwig Maximilian University of Munich
  • Runway AI
  • Stability AI

Stable Diffusion was indeed developed through the combined efforts of the aforementioned three. However, most of the credit should be attributed to Stability AI, because it funded and shaped the development of the model. Stability AI raised over 100 million USD to fund the research and development of the model. Moreover, what is of significant importance, it made the model open source. This means that, unlike its main competitors, Midjourney developed by the company of the same name, and  DALL-E developed by OpenAI, Stable Diffusion provides all users with full access to the model. Users are permitted to use, modify, and redistribute it, on the condition that they include the original copyright notice in any distribution of the software.

This openness leads to a great number of benefits. The most important benefit, and perhaps the most notable one, is that users have the ability to install the model on their hardware and experiment with it, making modifications as they see fit. This enables users to adjust the model to best serve their purposes. This feature, in turn, positions Stable Diffusion as arguably the most flexible generative model available.

How Do Stable Diffusion Models Work

The architecture of Stable Diffusion differs from standard diffusion models in two primary aspects. It takes advantage of the following:

  • Latent Space Optimization
  • Text Conditioning Mechanism

These two modifications are integrated throughout the three main parts of the model architecture. Those three parts are:

  • Text Encoder
  • U-Net model together with a scheduling algorithm
  • Variational Autoencoder (VAE)

The process of generating images from text involves a complex flow of information through the various components of the aforementioned architecture. Let us cover the process step-by-step.

Article continues below

What is the Text Encoder

The Text Encoder's primary role is to translate natural language inputs into a numerical format that the model can comprehend and analyze. This encoding captures the semantic essence of the text, which is crucial for ensuring that the generated images accurately reflect the input descriptions. 

Whether the encoder is trained from scratch or used as a pre-trained component depends on the specific implementation and design choices of the model. The most common approach is to use a pre-trained encoder. An encoder is rarely trained from scratch when training the rest of the model. To further elaborate, we use a Large Language Model (LLM) trained on an extensive corpus of text data. This model comprehends and encodes textual information into high-dimensional vectors. These vectors effectively represent the semantic and syntactic information contained in the text.

For Stable Diffusion, the typical approach involves using a pre-trained text encoder like OpenAI's CLIP (Contrastive Language–Image Pre-training) model or another similar model. These encoders are specifically engineered to comprehend the relationship between text and images, rendering them well-suited for text-to-image tasks. The CLIP model, for example, was trained on a variety of text and image pairs. As a result, this enables it to generate meaningful embeddings that effectively bridge the textual and visual domains.

What Is Variational Autoencoder - Encoder Part

The VAE model consists of two parts, an encoder and a decoder. The decoder will be explored later on. At this point, the focus falls on the encoder part. In fact, Stable Diffusion does not work directly with the pixels of the image, the way other diffusion models do Instead, it uses a method called latent space optimization. Essentially, the first thing we do is encode images into a lower-dimensional latent space using the encoder part of an autoencoder. This reduces the overall size of our working data. In other words, this reduces the size of the array we will operate on, and in turn, renders the entire process much more computationally efficient. 

What Is U-Net and the Scheduling Algorithm

The critical component of the model, and where the diffusion occurs, is the utilization of the U-Net model. U-Net is a Convolutional Neural Network (CNN) that was originally developed to perform image segmentation on medical images. It is a so-called "Fully Convolutional Network", which means that it consists purely of convolutional layers and does not contain any fully connected layers.

In the context of diffusion models, such as Stable Diffusion, U-Net is employed to systematically diffuse information provided to it across multiple steps. To be more precise, when training the model we first train the U-Net like a typical diffusion model. This entails training the model to add noise to images until they resemble Gaussian noise.  

Moreover, this means training it to remove noise from images to restore them to their original form. When performing inference with our Stable Diffusion model, the U-Net will be used strictly for denoising. Essentially, through multiple steps,  it gradually eliminates noise from the initial image (which is pure noise) until we obtain an image that closely resembles the ones used to train our model. This whole iterative process is controlled by the scheduler.

The information provided for the model during inference is a combination of the text embeddings we created using the Text Encoder together with a latent. From the text embeddings and the latents we created earlier, the U-Net model will produce a new array in the latent space. In theory, this should resemble the input text more than the starting array.

Such a process happens over multiple steps. During each step, the model operates on a starting array in the latent space. It transforms this array into an output array, with each output array progressively resembling the desired outcome more and more. The output array from each step serves as the input for the subsequent step in the U-Net model. Finally, after we have a final output latent we send it into the decoder.

What is Variational Autoencoder - Decoder Part

The last part of the Stable Diffusion architecture is the decoder of the autoencoder. The encoder was already used to encode the original noise into the latent space. Now it is time to utilize the decoder component of the autoencoder. Its role is to translate the output latent array obtained after multiple steps in the U-Net section of the architecture back into pixel space. 

How to Use Stable Diffusion Models from Hugging Face

Building a Stable Diffusion from scratch can be quite complicated. This can be a good idea only for the more experienced ones in Deep Learning and the ones that have a solid foundation in programming. However, for those who are beginners, it is suggested to use one of the prebuilt models from Hugging Face.

Hugging Face is a company that is distinguished for its success in the field of AI. Hugging Face provides a platform for sharing and discovering pre-trained models and datasets, facilitating collaboration within the AI community. This platform, known as the Hugging Face Hub, allows users to upload, share, and use models and datasets across a wide range of tasks. In the beginning, the focus was mostly on NLP tasks, however nowadays you can also find pre-trained Stable Diffusion models on Hugging Face.

To use these models we first need to install the so-called diffusers library. Through this library, we can access state-of-the-art pre-trained diffusion models for generating images. The principal reason why people prefer this library is that it is especially easy to use. We can start generating images with only a few lines of code. In this article, we will not cover training models, because that takes an abundance of time and computational resources. However,  we will instead focus on demonstrating how we conduct inference, which involves using a pre-trained model to generate- images based on text prompts.

Firstly, let us install the library. To do so, you can use pip to install the library in your environment by the following command:

pip install diffusers

After installing the library, go ahead and use one of the prebuilt models to generate images. The first thing we will do is import the StableDiffusionPipeline class from the diffusers library:

# Import what we need
from diffusers import StableDiffusionPipeline      

This class enables us to create a text-to-image generation pipeline using Stable Diffusion. The next thing we will do is load a pre-trained Stable Diffusion model.  There is the possibility to choose any of the variety of models offered by Hugging Face. However, for the sake of demonstration let us use the most popular Stable Diffusion model offered by Hugging Face. All you have to do here is store the name of the model in a variable as a string:

# Define which model we will use to generate images
model_id = "runwayml/stable-diffusion-v1-5"

Finally, let us create a pipeline using the pre-trained model and the class we imported earlier. To do so, we use the from_pretrained method of the StableDiffusionPipeline class:

# Create a pipeline
pipeline = StableDiffusionPipeline.from_pretrained(model_id)  

If you have access to a GPU, it is advisable to transfer the pipeline to the GPU to ensure that inference, namely creating images, is as fast as possible. To do so, we use the following code:

# Move the pipeline to the GPU
pipeline = pipeline.to("cuda")

Finally, it is time to generate an image. To generate images we use the pipeline we defined earlier. We only need to input a prompt and we will get an image back.

# Generate an image
image = pipeline("beautiful sunset").images[0]    

This will generate an image and store it in the image variable. Keep in mind that you will most likely get an image that is slightly different from mine, which is of course a byproduct of the way Stable Diffusion models generate images. In my case, the model generated the following image:

Stable Diffusion Generated Image

Certainly, if desired, one could engage in prompt engineering to precisely specify what should appear in the image. However, prompt engineering is a complex topic deserving of its own dedicated article, so I will not delve into it at this moment.

In this article, we covered how Stable Diffusion, one of the most popular models for generating images, works. First, we explored what Stable Diffusion is and how it works in detail. Later,  we demonstrated how to use a pre-trained Stable Diffusion model from the Hugging Face platform to generate images using just a few lines of code. By enabling intuitive interaction through prompts, even individuals with no prior AI experience can engage with the model effectively. 

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.