b34be039f9
* beta never changes removed from state * fix typos in docs * removed unused var * initial ddim flax scheduler * import * added dummy objects * fix style * fix typo * docs * fix typo in comment * set return type * added flax ddom * fix style * remake * pass PRNG key as argument and split before use * fix doc string * use config * added flax Karras VE scheduler * make style * fix dummy * fix ndarray type annotation * replace returns a new state * added lms_discrete scheduler * use self.config * add_noise needs state * use config * use config * docstring * added flax score sde ve * fix imports * fix typos |
||
---|---|---|
.. | ||
README.md | ||
requirements.txt | ||
textual_inversion.py |
README.md
Textual Inversion fine-tuning example
Textual inversion is a method to personalize text2image models like stable diffusion on your own images using just 3-5 examples.
The textual_inversion.py
script shows how to implement the training procedure and adapt it for stable diffusion.
Running on Colab
Running locally
Installing the dependencies
Before running the scipts, make sure to install the library's training dependencies:
pip install diffusers[training] accelerate transformers
And initialize an 🤗Accelerate environment with:
accelerate config
Cat toy example
You need to accept the model license before downloading or using the weights. In this example we'll use model version v1-4
, so you'll need to visit its card, read the license and tick the checkbox if you agree.
You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to this section of the documentation.
Run the following command to autheticate your token
huggingface-cli login
If you have already cloned the repo, then you won't need to go through these steps. You can simple remove the --use_auth_token
arg from the following command.
Now let's get our dataset.Download 3-4 images from here and save them in a directory. This will be our training data.
And launch the training using
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="path-to-dir-containing-images"
accelerate launch textual_inversion.py \
--pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
--train_data_dir=$DATA_DIR \
--learnable_property="object" \
--placeholder_token="<cat-toy>" --initializer_token="toy" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=3000 \
--learning_rate=5.0e-04 --scale_lr \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--output_dir="textual_inversion_cat"
A full training run takes ~1 hour on one V100 GPU.
Inference
Once you have trained a model using above command, the inference can be done simply using the StableDiffusionPipeline
. Make sure to include the placeholder_token
in your prompt.
from torch import autocast
from diffusers import StableDiffusionPipeline
model_id = "path-to-your-trained-model"
pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
prompt = "A <cat-toy> backpack"
with autocast("cuda"):
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image.save("cat-backpack.png")