🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
Go to file
Nathan Raw 627ad6e8ea
Rename frame filename in interpolation community example (#881)
🎨 rename frame filename
2022-10-17 20:08:58 +02:00
.github [CI] Speed up slow tests (#708) 2022-10-03 22:16:23 +02:00
docs/source Update img2img.mdx 2022-10-12 00:52:30 +02:00
examples Rename frame filename in interpolation community example (#881) 2022-10-17 20:08:58 +02:00
scripts Fix ONNX conversion script opset argument type (#739) 2022-10-07 15:47:43 +02:00
src/diffusers Fix small community pipeline import bug and finish README (#869) 2022-10-17 15:07:48 +02:00
tests Remove the last of ["sample"] (#842) 2022-10-14 14:45:43 +02:00
utils [Dummy imports] Better error message (#795) 2022-10-12 14:41:16 +02:00
.gitignore upload some cleaning tools 2022-05-31 10:17:19 +02:00
CODE_OF_CONDUCT.md [Docs] Add some guides (#276) 2022-08-30 19:13:07 +02:00
CONTRIBUTING.md [Docs] Add some guides (#276) 2022-08-30 19:13:07 +02:00
LICENSE init upload 2022-05-30 18:21:15 +02:00
MANIFEST.in Minor package fixes (#809) 2022-10-12 13:22:51 +02:00
Makefile Style the `scripts` directory (#250) 2022-08-25 15:46:09 +02:00
README.md Minor package fixes (#809) 2022-10-12 13:22:51 +02:00
_typos.toml Fix typos (#568) 2022-09-19 21:58:41 +02:00
pyproject.toml init upload 2022-05-30 18:21:15 +02:00
setup.cfg Fix flake8 F401 imported but unused (#317) 2022-09-01 14:56:25 +02:00
setup.py Bump to 0.6.0.dev0 (#831) 2022-10-14 13:43:52 +02:00

README.md



GitHub GitHub release Contributor Covenant

🤗 Diffusers provides pretrained diffusion models across multiple modalities, such as vision and audio, and serves as a modular toolbox for inference and training of diffusion models.

More precisely, 🤗 Diffusers offers:

  • State-of-the-art diffusion pipelines that can be run in inference with just a couple of lines of code (see src/diffusers/pipelines). Check this overview to see all supported pipelines and their corresponding official papers.
  • Various noise schedulers that can be used interchangeably for the preferred speed vs. quality trade-off in inference (see src/diffusers/schedulers).
  • Multiple types of models, such as UNet, can be used as building blocks in an end-to-end diffusion system (see src/diffusers/models).
  • Training examples to show how to train the most popular diffusion model tasks (see examples, e.g. unconditional-image-generation).

Installation

With pip

pip install --upgrade diffusers

With conda

conda install -c conda-forge diffusers

Apple Silicon (M1/M2) support

Please, refer to the documentation.

Contributing

We ❤️ contributions from the open-source community! If you want to contribute to this library, please check out our Contribution guide. You can look out for issues you'd like to tackle to contribute to the library.

Also, say 👋 in our public Discord channel Join us on Discord. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out .

Quickstart

In order to get started, we recommend taking a look at two notebooks:

  • The Getting started with Diffusers Open In Colab notebook, which showcases an end-to-end example of usage for diffusion models, schedulers and pipelines. Take a look at this notebook to learn how to use the pipeline abstraction, which takes care of everything (model, scheduler, noise handling) for you, and also to understand each independent building block in the library.
  • The Training a diffusers model Open In Colab notebook summarizes diffusion models training methods. This notebook takes a step-by-step approach to training your diffusion models on an image dataset, with explanatory graphics.

New Stable Diffusion is now fully compatible with diffusers!

Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It's trained on 512x512 images from a subset of the LAION-5B database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM. See the model card for more information.

You need to accept the model license before downloading or using the Stable Diffusion weights. Please, visit the model card, read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to this section of the documentation.

Text-to-Image generation with Stable Diffusion

We recommend using the model in half-precision (fp16) as it gives almost always the same results as full precision while being roughly twice as fast and requiring half the amount of GPU RAM.

# make sure you're logged in with `huggingface-cli login`
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_type=torch.float16, revision="fp16")
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  

Note: If you don't want to use the token, you can also simply download the model weights (after having accepted the license) and pass the path to the local folder to the StableDiffusionPipeline.

git lfs install
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4

Assuming the folder is stored locally under ./stable-diffusion-v1-4, you can also run stable diffusion without requiring an authentication token:

pipe = StableDiffusionPipeline.from_pretrained("./stable-diffusion-v1-4")
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  

If you are limited by GPU memory, you might want to consider chunking the attention computation in addition to using fp16. The following snippet should result in less than 4GB VRAM.

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    revision="fp16", 
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_attention_slicing()
image = pipe(prompt).images[0]  

If you wish to use a different scheduler, you can simply instantiate it before the pipeline and pass it to from_pretrained.

from diffusers import LMSDiscreteScheduler

lms = LMSDiscreteScheduler(
    beta_start=0.00085, 
    beta_end=0.012, 
    beta_schedule="scaled_linear"
)

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    revision="fp16", 
    torch_dtype=torch.float16,
    scheduler=lms,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")

If you want to run Stable Diffusion on CPU or you want to have maximum precision on GPU, please run the model in the default full-precision setting:

# make sure you're logged in with `huggingface-cli login`
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")

# disable the following line if you run on CPU
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
    
image.save("astronaut_rides_horse.png")

Image-to-Image text-guided generation with Stable Diffusion

The StableDiffusionImg2ImgPipeline lets you pass a text prompt and an initial image to condition the generation of new images.

import requests
import torch
from PIL import Image
from io import BytesIO

from diffusers import StableDiffusionImg2ImgPipeline

# load the pipeline
device = "cuda"
model_id_or_path = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    model_id_or_path,
    revision="fp16", 
    torch_dtype=torch.float16,
)
# or download via git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
# and pass `model_id_or_path="./stable-diffusion-v1-4"`.
pipe = pipe.to(device)

# let's download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, init_image=init_image, strength=0.75, guidance_scale=7.5).images

images[0].save("fantasy_landscape.png")

You can also run this example on colab Open In Colab

In-painting using Stable Diffusion

The StableDiffusionInpaintPipeline lets you edit specific parts of an image by providing a mask and text prompt.

from io import BytesIO

import torch
import requests
import PIL

from diffusers import StableDiffusionInpaintPipeline

def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

device = "cuda"
model_id_or_path = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionInpaintPipeline.from_pretrained(
    model_id_or_path,
    revision="fp16", 
    torch_dtype=torch.float16,
)
# or download via git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
# and pass `model_id_or_path="./stable-diffusion-v1-4"`.
pipe = pipe.to(device)

prompt = "a cat sitting on a bench"
images = pipe(prompt=prompt, init_image=init_image, mask_image=mask_image, strength=0.75).images

images[0].save("cat_on_bench.png")

Tweak prompts reusing seeds and latents

You can generate your own latents to reproduce results, or tweak your prompt on a specific result you liked. This notebook shows how to do it step by step. You can also run it in Google Colab Open In Colab.

For more details, check out the Stable Diffusion notebook Open In Colab and have a look into the release notes.

Examples

There are many ways to try running Diffusers! Here we outline code-focused tools (primarily using DiffusionPipelines and Google Colab) and interactive web-tools.

Running Code

If you want to run the code yourself 💻, you can try out:

# !pip install diffusers transformers
from diffusers import DiffusionPipeline

device = "cuda"
model_id = "CompVis/ldm-text2im-large-256"

# load model and scheduler
ldm = DiffusionPipeline.from_pretrained(model_id)
ldm = ldm.to(device)

# run pipeline in inference (sample random noise and denoise)
prompt = "A painting of a squirrel eating a burger"
image = ldm([prompt], num_inference_steps=50, eta=0.3, guidance_scale=6).images[0]

# save image
image.save("squirrel.png")
# !pip install diffusers
from diffusers import DDPMPipeline, DDIMPipeline, PNDMPipeline

model_id = "google/ddpm-celebahq-256"
device = "cuda"

# load model and scheduler
ddpm = DDPMPipeline.from_pretrained(model_id)  # you can replace DDPMPipeline with DDIMPipeline or PNDMPipeline for faster inference
ddpm.to(device)

# run pipeline in inference (sample random noise and denoise)
image = ddpm().images[0]

# save image
image.save("ddpm_generated_image.png")

Other Notebooks:

Web Demos

If you just want to play around with some web demos, you can try out the following 🚀 Spaces:

Model Hugging Face Spaces
Text-to-Image Latent Diffusion Hugging Face Spaces
Faces generator Hugging Face Spaces
DDPM with different schedulers Hugging Face Spaces
Conditional generation from sketch Hugging Face Spaces
Composable diffusion Hugging Face Spaces

Definitions

Models: Neural network that models p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) (see image below) and is trained end-to-end to denoise a noisy input to an image. Examples: UNet, Conditioned UNet, 3D UNet, Transformer UNet


Figure from DDPM paper (https://arxiv.org/abs/2006.11239).

Schedulers: Algorithm class for both inference and training. The class provides functionality to compute previous image according to alpha, beta schedule as well as predict noise for training. Examples: DDPM, DDIM, PNDM, DEIS


Sampling and training algorithms. Figure from DDPM paper (https://arxiv.org/abs/2006.11239).

Diffusion Pipeline: End-to-end pipeline that includes multiple diffusion models, possible text encoders, ... Examples: Glide, Latent-Diffusion, Imagen, DALL-E 2


Figure from ImageGen (https://imagen.research.google/).

Philosophy

  • Readability and clarity is preferred over highly optimized code. A strong importance is put on providing readable, intuitive and elementary code design. E.g., the provided schedulers are separated from the provided models and provide well-commented code that can be read alongside the original paper.
  • Diffusers is modality independent and focuses on providing pretrained models and tools to build systems that generate continuous outputs, e.g. vision and audio.
  • Diffusion models and schedulers are provided as concise, elementary building blocks. In contrast, diffusion pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box, should stay as close as possible to their original implementation and can include components of another library, such as text-encoders. Examples for diffusion pipelines are Glide and Latent Diffusion.

In the works

For the first release, 🤗 Diffusers focuses on text-to-image diffusion techniques. However, diffusers can be used for much more than that! Over the upcoming releases, we'll be focusing on:

A few pipeline components are already being worked on, namely:

  • BDDMPipeline for spectrogram-to-sound vocoding
  • GLIDEPipeline to support OpenAI's GLIDE model
  • Grad-TTS for text to audio generation / conditional audio generation

We want diffusers to be a toolbox useful for diffusers models in general; if you find yourself limited in any way by the current API, or would like to see additional models, schedulers, or techniques, please open a GitHub issue mentioning what you would like to see.

Credits

This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:

  • @CompVis' latent diffusion models library, available here
  • @hojonathanho original DDPM implementation, available here as well as the extremely useful translation into PyTorch by @pesser, available here
  • @ermongroup's DDIM implementation, available here.
  • @yang-song's Score-VE and Score-VP implementations, available here

We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available here as well as @crowsonkb and @rromb for useful discussions and insights.

Citation

@misc{von-platen-etal-2022-diffusers,
  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Thomas Wolf},
  title = {Diffusers: State-of-the-art diffusion models},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/diffusers}}
}