General fine tuning for Stable Diffusion

Go to file

Xavier f49a14378b Update README.md		2022-09-06 09:52:08 -07:00
configs	change config	2022-09-06 09:41:00 -07:00
evaluation	initial commit	2022-09-06 00:00:21 -07:00
img	initial commit	2022-09-06 00:00:21 -07:00
ldm	initial commit	2022-09-06 00:00:21 -07:00
models	initial commit	2022-09-06 00:00:21 -07:00
scripts	initial commit	2022-09-06 00:00:21 -07:00
LICENSE	initial commit	2022-09-06 00:00:21 -07:00
README.md	Update README.md	2022-09-06 09:52:08 -07:00
environment.yaml	initial commit	2022-09-06 00:00:21 -07:00
main.py	initial commit	2022-09-06 00:00:21 -07:00
merge_embeddings.py	initial commit	2022-09-06 00:00:21 -07:00
setup.py	initial commit	2022-09-06 00:00:21 -07:00

README.md

Dreambooth on Stable Diffusion

This is an implementtaion of Google's Dreambooth with Stable Diffusion. The original Dreambooth is based on Imagen text-to-image model. However, neither the model nor the pre-trained weights of Imagen is available. To enable people to fine-tune a text-to-image model with a few examples, I implemented the idea of Dreambooth on Stable diffusion.

This code repository is based on that of Textual Inversion. Note that Textual Inversion only optimizes word ebedding, while dreambooth fine-tunes the whole diffusion model.

The implementation makes minimum changes over the official codebase of Textual Inversion. In fact, due to lazyness, some components in Textual Inversion, such as the embedding manager, are not deleted, although they will never be used here.

Usage

Preparation

To fine-tune a stable diffusion model, you need to obtain the pre-trained stable diffusion models following their instructions. Weights can be downloads on HuggingFace. You can decide which version of checkpoint to use, but I use sd-v1-4-full-ema.ckpt.

We also need to create a set of images for regularization, as the fine-tuning algorithm of Dreambooth requires that. Details of the algorithm can be found in the paper. The text prompt can be photo of a xxx, where xxx is a word that describes the class of your object, such as dog. The command is

python scripts/stable_txt2img.py --ddim_eta 0.0 --n_samples 8 --n_iter 1 --scale 10.0 --ddim_steps 50  --ckpt /path/to/original/stable-diffusion/sd-v1-4-full-ema.ckpt --prompt "a photo of a <xxx>"

I generate 8 images for regularization. After that, save the generated images (separately, one image per .png file) at /root/to/regularization/images.

Training

Training can be done by running the following command

python main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml 
                -t 
                --actual_resume /path/to/original/stable-diffusion/sd-v1-4-full-ema.ckpt  
                -n <job name> 
                --gpus 0, 
                --data_root /root/to/training/images 
                --reg_data_root /root/to/regularization/images 
                --class_word <xxx>

Detailed configuration can be found in configs/stable-diffusion/v1-finetune_unfrozen.yaml. In particular, the default learning rate is 1.0e-6 as I found the 1.0e-5 in the Dreambooth paper leads to poor editability. The parameter reg_weight corresponds to the weight of regularization in the Dreambooth paper, and the default is set to 1.0.

Dreambooth requires a placeholder word [V], called identifier, as in the paper. This identifier needs to be a relatively rare tokens in the vocabulary. The original paper approaches this by using a rare word in T5-XXL tokenizer. For simplicity, here I just use a random word sks and hard coded it.. If you want to change that, simply make a change in this file.

Training will be run for 800 steps, and two checkpoints will be saved at ./logs/<job_name>/checkpoints, one at 500 steps and one at final step. Typically the one at 500 steps works well enough. I train the model use two A6000 GPUs and it takes ~15 mins.

Generation

After training, personalized samples can be obtained by running the command

python scripts/stable_txt2img.py --ddim_eta 0.0 
                                 --n_samples 8 
                                 --n_iter 1 
                                 --scale 10.0 
                                 --ddim_steps 100  
                                 --ckpt /path/to/saved/checkpoint/from/training
                                 --prompt "photo of a sks <class>"

In particular, sks is the identifier, which should be replaced by your choice if you happen to change the identifier, and <class> is the class word --class_word for training.

Results

Here I show some qualitative results. The training images are obtained from the issue in the Textual Inversion repository, and they are 3 images of a large trash container. Regularization images are generated by prompt photo of a container.