Latent Diffusion Explained Simply (with Pokémon)

From Text to Image, Image to Image and Inpainting — the Latent Diffusion revolution.

Published in

Towards AI

8 min readSep 16, 2022

Latent diffusion has been at the center of attention for the past few months, with people generating all sorts of images from text prompts.

Image generated using stable diffusion with the prompt “a photograph of an astronaut riding a horse”

Looking at the high quality makes you wonder what this technology could be used for in the future. And I’m not just talking about images. Imagine a 3D world procedurally built as a player explores a region in a game, designing novel enzymes for, say, digesting plastic purely based on a description, or even generating hypotheses/questions/corrections given a scientific paper (every Ph.D. student’s dream).

With this article, I want to dig deep into these models and (attempt to) explain how they work (with Pokémon). By the end, you should gain an intuitive understanding of what happens after typing your text prompt. I structured it a bit differently than I normally would: first, the topic is introduced, followed by some visual results and intuitions, and finally, a more in-depth explanation. Just as the famous quote goes:

“If you can’t explain it simply, you don’t understand it well enough”

Why Pokémon? Because you are probably familiar with the games and there’s a chance you have enjoyed playing them. Also, it is relatively niche, so it should push the model out of its “comfort zone” (i.e., less populated parts of the latent space ).

If you’d like to play around with the model, try out the demo here:

Stable Diffusion - a Hugging Face Space by stabilityai

Discover amazing ML apps made by the community

huggingface.co

Alternatively, you can browse a collection of generated images using the search engine Lexica (https://lexica.art)

Let’s dive into some use cases!

Text to Image

Text to Image is probably the most famous application of latent diffusion models (at least according to my Twitter feed). Much like a conditional-VAE, we use a text prompt (y) to generate conditionally from the distribution p(z|y), thus serving as a guide to an area of the latent space.

I have experimented using “Pikachu” which I figured the model would have seen pictures of. Additionally, I have inserted the location prompts to be about The Starry Night by Vincent van Gogh and the Canary Islands. The former should push the model to draw a 2D Pikachu to fit the art style, while the latter, being a real-world location, could result in a 3D Pikachu.

Results

Prompt: “a pikachu surfing in the starry night painting”

This is quite pretty — even considering the occasional deformities of the Pikachu. The first picture is probably my favorite as it even picked up the church-like building and added a field.

Prompt: “a pikachu surfing in the Canary Islands”

Definitely more successful in getting the surfboard and the water. An interesting detail is the dunes of sand in some of the images which you can find in the Canary Islands:

Sand Dunes in Gran Canaria — Photo by Bryan Li https://bryanli.io

How it works

The input y is pre-processed by a domain-specific encoder (tau_theta) to produce a representation tau_theta(y).

Conditioning step encoded by tau_theta to condition the generation process. Image from https://arxiv.org/abs/2112.10752

The underlying UNet architecture is implemented with cross-attention layers, meaning that at each unit of the UNet, attention is calculated and then passed to later layers.

Attention is usually expressed with the Query, Key, Value terminology. Here, these values will be modified by taking into account the output of the domain-specific encoder tau_theta(y) as such:

Meaning we are encoding our prompt to condition the de-noising step and therefore output an image that would fit the description.

Image to Image

This task is quite similar to the Text To Image. The input, however, is an image that serves as a prompt. The image is then encoded using the domain-specific encoder tau_theta(y) and fed into the model just like text. In addition to the image, you can also input a text prompt to guide the generation process further.

Results

My inputs to the models were:

“A fire-type Pokémon spitting blue fire.”
This doodle (I know, it looks terrible):

The results surprised me:

Generated images with settings –strength 0.8 –n_iter 30 and prompt “A fire type pokemon spitting blue fire”

Most of these look Pokémon-like, and several of them integrate the blue as fire.

Unfortunately, none of them used the tophat as a tophat (we can blame my drawing skills for this), but most of them used the black color on the head.

Inpainting

Inpainting is the practice of removing or replacing objects in an image based on a mask. I remember back in 2010 (or potentially earlier) when Photoshop released it as a feature as part of their “Healing Brush,” and I was just astonished.

Latent diffusion models, you guessed it, can be used for this too! HuggingFace has a great intuitive User Interface where you can paint the mask on the image itself:

InPainting Stable Diffusion CPU - a Hugging Face Space by fffiloni

Discover amazing ML apps made by the community

huggingface.co

Results

Input the image and mask on the left. The generated image on the right.

The hat looks somewhat like a squashed version of the Harry Potter sorting hat — in the AI’s defense, we did give it quite a small mask. I also like that it adapted it to the style of the video game.

There are a few other elements to notice:

There was an attempt at fixing the flooring around the hat. It’s not amazing, but it is a start.
Dawn’s eyes appear darker and almost empty in the generated image. Her mouth also changed shape slightly. I believe this is due to the loss of information in the encoding and decoding step

Finally, I also tried to create a new character:

It has a backpack and a weirdly-shaped hat and legs. The “blue” prompt was taken up for the trousers and the hat partially.

How it works

Very similar to Image To Image, in the Inpainting process, we modify the masked area (“unknown region”) while keeping the rest of the image (“known region”) fixed. As with other models, we can input a text prompt which will be encoded and passed on together with the known region of the image. The model will then fill in the unknown region with an image that fits the shape of the region AND the text prompt.

Earlier, I mentioned how the latent representation offers advantages in computation and therefore speed. For Inpainting, latent diffusion models are at least 2.7x faster with a 1.6x increase in Fréchet Inception Distance (FID). This metric quantified the difference in distribution (mean and standard deviation) between real images and generated images rather than pixel-by-pixel metrics. The idea is that it tries to imitate the “perception” of similarity between images.

Other Uses

The paper presents other ways latent diffusion models could be used: image super-resolution and layout-to-image synthesis.

Image super-resolution is the process of generating a high-resolution image from a low-resolution one. It is particularly useful in biomedical applications like MRI scans. I doubt the model could be used effectively in such tasks without any re-training. However, with normal images, the results are visually pleasing:

Image from the original paper https://arxiv.org/abs/2112.10752

Layout-to-image synthesis aims at generating images that fit a bounding box which guides the generation of an element in a specific position of the image. This is an example:

Limitations & Ethical Considerations

One very intuitive limitation is that the model cannot generate images without any prior knowledge about a topic. For instance, without knowing what The Starry Night looks like, it would be impossible to generate images related to it. It would be interesting whether similar-looking images could be generated purely based on a text description and without any painting from van Gogh.
Potential misuse of the AI model can have a significant societal impact, for example, creating fake images related to spam/news or offensive images. The original repository has several NSFW filters which hide the offensive images and also place an invisible watermark on the image, although it is not impossible to bypass it given the open-source nature of the model.
These models are particularly susceptible to language bias and may display it in the generation of the images. Therefore, it is possible for the AI to misunderstand the prompt and generate potentially unrelated images.
This model has been used to generate art and has even won art competitions (without the judges knowing it was AI-generated). Should this be allowed? Also, would you be interested in an art competition using (only) latent diffusion models?

An A.I. Beat Human Artists in a Competition. Will It Come for Their Jobs Next?

Listen to What Next: TBD: Last month, a piece of art called Théâtre D'opéra Spatial (that's French for "Space Opera…

slate.com

☕️ Support My Work!

I’m Dr. Leucine, sharing insights on 🧠 knowledge management with Obsidian and 🍥 protein design. If you find my content helpful, consider buying me a coffee to keep the ideas flowing!

👉 Buy me a coffee

Conclusion

In this article, I covered three uses of latent diffusion models: Text To Image, Image to Image, and Inpainting. I explored (in the context of Pokémon) how each of these could be used to generate images and peculiar details for each generated image.

There are other ways these models can be used, for example for image super-resolution and layout-to-image synthesis. If you would like a more mathematical explanation of some of the parts of this blog post, I highly recommend the original paper:

High-Resolution Image Synthesis with Latent Diffusion Models

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models…

arxiv.org

or this blog post:

What are Diffusion Models?

Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of…

lilianweng.github.io

Disclaimer: all the generated images in this post have been created by stable diffusion (and me) using the official GitHub code:

GitHub - CompVis/stable-diffusion: A latent text-to-image diffusion model

Stable Diffusion was made possible thanks to a collaboration with Stability AI and Runway and builds upon our previous…

github.com

If you do find any mistakes or inaccuracies, please feel free to comment/contact me, and I will update the article (and include you in the special thanks section).

Towards AI

Latent Diffusion Explained Simply (with Pokémon)

From Text to Image, Image to Image and Inpainting — the Latent Diffusion revolution.

Stable Diffusion - a Hugging Face Space by stabilityai

Discover amazing ML apps made by the community

Text to Image

Results

How it works

Image to Image

Results

Inpainting

InPainting Stable Diffusion CPU - a Hugging Face Space by fffiloni

Discover amazing ML apps made by the community

Results

How it works

Other Uses

Limitations & Ethical Considerations

An A.I. Beat Human Artists in a Competition. Will It Come for Their Jobs Next?

Listen to What Next: TBD: Last month, a piece of art called Théâtre D'opéra Spatial (that's French for "Space Opera…

☕️ Support My Work!

Conclusion

High-Resolution Image Synthesis with Latent Diffusion Models

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models…

What are Diffusion Models?

Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of…

GitHub - CompVis/stable-diffusion: A latent text-to-image diffusion model

Stable Diffusion was made possible thanks to a collaboration with Stability AI and Runway and builds upon our previous…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Towards AI

Written by Leonardo Castorina

Responses (1)