Comparing Stable Diffusion Models

Prompting Pixels
7 min readApr 2, 2024

Stable Diffusion, an open-source text-to-image model released by Stability AI, has revolutionized the field of generative AI.

Since its initial release in 2022, several iterations and improvements have been made over the past couple of years.

Here’s what you need to know about the major releases:

+----------------+---------------+
| Version number | Release date |
+----------------+---------------+
| 1.1 | June 2022 |
| 1.2 | June 2022 |
| 1.3 | June 2022 |
| 1.4 | August 2022 |
| 1.5 | October 2022 |
| 2.0 | November 2022 |
| 2.1 | December 2022 |
| XL 1.0 | July 2023 |
| XL Turbo | November 2023 |
| Cascade | February 2024 |
| 3.0 | Coming Soon |
+----------------+---------------+

Stable Diffusion 1.x Models

The first generation of Stable Diffusion models, known as the 1.x series, includes versions 1.1, 1.2, 1.3, 1.4, and 1.5.

These models have a resolution of 512x512 pixels and use a ViT-L/14 CLIP model for text conditioning.

The 1.x models have 860 million parameters.

Sample Outputs

Key Facts

Good use cases for this model: Generate a wide range of styles and subjects. Relatively low computational requirements.

Poor use cases for this model: Poor prompt comprehension and resolution. Disfigured subjects. Flat-looking images.

Fine-Tuned Models

While Stable Diffusion 1.5 provides outputs that, let’s be honest, don’t look all that great, the open-source community has far better models available.

There are thousands for specific use cases including photo realism, cartoon, anime images, and more.

For example, DreamShaper, Juggernaut, and RealCartoon are only just a few such models that rely on Stable Diffusion 1.5 as a base model yet deliver stunning results:

Stable Diffusion 2.x Models

Released in late 2022, the 2.x series includes versions 2.0 and 2.1. These models have an increased resolution of 768x768 pixels and use a different CLIP model called ViT-H/14 — allowing the prompts to be more expressive.

The different CLIP model used by 2.x made it difficult for people to migrate from 1.x as the prompts really didn’t transfer that well — hampering its widespread use in the open source community.

The number of parameters in these models is the same as 1.5, 860 million per their GitHub Readme.

Sample Outputs

Key Facts

Good use cases for this model: Higher resolution outputs compared to 1.x models. Improved handling of complex and expressive prompts. Better performance on architectural and landscape subjects rather than people. Nice dynamic range of colors.

Poor use cases for this model: More restrictive in generations. Censored celebrities and art styles.

Fine-Tuned Models

Stable Diffusion 2.0 and 2.1 are not as widely adapted as 1.5 from the open-source community — however some fine-tuned models do exist.

Stable Diffusion XL 1.0

Released in 2023, SDXL 1.0 provides Midjourney and Dall-E level of outputs that can be run on consumer-level hardware. With a resolution of 1024x1024 pixels and relying on OpenCLIP-ViT/G and CLIP-ViT/L for text conditioning, SDXL makes it much easier to get the results you may be after.

Per the initial release by Stability AI, SDXL 1.0 has a 3.5B parameter base model and a 6.6B parameter model ensemble pipeline:

Sample Outputs

Key Facts

Good use cases for this model: Highest resolution outputs among Stable Diffusion models. Improved color depth, composition, and overall image quality. Better understanding of complex prompts and concepts.

Poor use cases for this model: Requires significant computational resources to run locally. May be challenging to run on consumer-grade hardware. Things like hands still aren’t quite right.

Fine-Tuned Models

The open-source community has embraced SDXL and has released several fine-tuned models that produce quality outputs with SDXL.

Juggernaut XL, DreamShaper XL, RealVisXL, and Animagine XL are among the most popular and can serve various use cases:

SDXL Turbo

SDXL Turbo is a distilled version of SD XL 1.0, designed for rapid generation of 512x512 pixel images. It uses the same text conditioning models as SD XL 1.0 and has 3.5 billion parameters. SDXL Turbo is able to produce images in just a single step.

Sample Outputs

Key Facts:

Good use cases for this model: Provides good outputs in very little time. Prototyping applications and workflows. Real-time experiments.

Poor use cases for this model: Non-commercial license limits it to just personal and/or research use only.

Fine-Tuned Models

Like 2.1, the ecosystem of open-source models for SDXL Turbo is limited. While models exist, most creators have put their efforts towards more popular base models including SDXL and SD 1.5.

Stable Cascade

Stable Cascade is a unique model that uses the Würstchen architecture, which allows for more efficient training and inference. It works in three stages (C, B, and A) and has a compression factor of 42:

Stages C (1 billion or 3.6 billion parameters) and B (700 million or 1.5 billion parameters) are interchangeable, allowing you to use different models depending on your hardware requirements and/or limitations.

Like SDXL Turbo, Stable Cascade is a research-only model.

Sample Outputs

Key Facts:

Good use cases for this model: Provides SDXL quality outputs and better prompt comprehension. May provide quicker outputs depending on the models used. Details like hands, teeth, etc. generate better.

Poor use cases for this model: Requires significant VRAM to load the model. Wide support by the open-source community is yet to be seen.

Fine-Tuned Models

Currently, very few fine-tuned models exist for Stable Cascade.

Stable Diffusion 3.0

The latest addition to the Stable Diffusion family, announced in March 2024, is Stable Diffusion 3.0. While details are scarce, initial results show significant improvements in prompt alignment and overall image quality.

We’ll update this section if/when Stable Diffusion 3 is released.

--

--

Prompting Pixels

Official account for Prompting Pixels (YT Channel & Website)