212 lines
7.7 KiB
Markdown
212 lines
7.7 KiB
Markdown
# DreamBooth training example
|
|
|
|
[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text2image models like stable diffusion given just a few(3~5) images of a subject.
|
|
The `train_dreambooth.py` script shows how to implement the training procedure and adapt it for stable diffusion.
|
|
|
|
|
|
## Running locally
|
|
### Installing the dependencies
|
|
|
|
Before running the scripts, make sure to install the library's training dependencies:
|
|
|
|
```bash
|
|
pip install git+https://github.com/huggingface/diffusers.git
|
|
pip install -U -r requirements.txt
|
|
```
|
|
|
|
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
|
|
|
|
```bash
|
|
accelerate config
|
|
```
|
|
|
|
### Dog toy example
|
|
|
|
You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
|
|
|
|
You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
|
|
|
|
Run the following command to authenticate your token
|
|
|
|
```bash
|
|
huggingface-cli login
|
|
```
|
|
|
|
If you have already cloned the repo, then you won't need to go through these steps.
|
|
|
|
<br>
|
|
|
|
Now let's get our dataset. Download images from [here](https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ) and save them in a directory. This will be our training data.
|
|
|
|
And launch the training using
|
|
|
|
```bash
|
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
|
export INSTANCE_DIR="path-to-instance-images"
|
|
export OUTPUT_DIR="path-to-save-model"
|
|
|
|
accelerate launch train_dreambooth.py \
|
|
--pretrained_model_name_or_path=$MODEL_NAME \
|
|
--instance_data_dir=$INSTANCE_DIR \
|
|
--output_dir=$OUTPUT_DIR \
|
|
--instance_prompt="a photo of sks dog" \
|
|
--resolution=512 \
|
|
--train_batch_size=1 \
|
|
--gradient_accumulation_steps=1 \
|
|
--learning_rate=5e-6 \
|
|
--lr_scheduler="constant" \
|
|
--lr_warmup_steps=0 \
|
|
--max_train_steps=400
|
|
```
|
|
|
|
### Training with prior-preservation loss
|
|
|
|
Prior-preservation is used to avoid overfitting and language-drift. Refer to the paper to learn more about it. For prior-preservation we first generate images using the model with a class prompt and then use those during training along with our data.
|
|
According to the paper, it's recommended to generate `num_epochs * num_samples` images for prior-preservation. 200-300 works well for most cases.
|
|
|
|
```bash
|
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
|
export INSTANCE_DIR="path-to-instance-images"
|
|
export CLASS_DIR="path-to-class-images"
|
|
export OUTPUT_DIR="path-to-save-model"
|
|
|
|
accelerate launch train_dreambooth.py \
|
|
--pretrained_model_name_or_path=$MODEL_NAME \
|
|
--instance_data_dir=$INSTANCE_DIR \
|
|
--class_data_dir=$CLASS_DIR \
|
|
--output_dir=$OUTPUT_DIR \
|
|
--with_prior_preservation --prior_loss_weight=1.0 \
|
|
--instance_prompt="a photo of sks dog" \
|
|
--class_prompt="a photo of dog" \
|
|
--resolution=512 \
|
|
--train_batch_size=1 \
|
|
--gradient_accumulation_steps=1 \
|
|
--learning_rate=5e-6 \
|
|
--lr_scheduler="constant" \
|
|
--lr_warmup_steps=0 \
|
|
--num_class_images=200 \
|
|
--max_train_steps=800
|
|
```
|
|
|
|
### Training on a 16GB GPU:
|
|
|
|
With the help of gradient checkpointing and the 8-bit optimizer from bitsandbytes it's possible to run train dreambooth on a 16GB GPU.
|
|
|
|
Install `bitsandbytes` with `pip install bitsandbytes`
|
|
|
|
```bash
|
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
|
export INSTANCE_DIR="path-to-instance-images"
|
|
export CLASS_DIR="path-to-class-images"
|
|
export OUTPUT_DIR="path-to-save-model"
|
|
|
|
accelerate launch train_dreambooth.py \
|
|
--pretrained_model_name_or_path=$MODEL_NAME \
|
|
--instance_data_dir=$INSTANCE_DIR \
|
|
--class_data_dir=$CLASS_DIR \
|
|
--output_dir=$OUTPUT_DIR \
|
|
--with_prior_preservation --prior_loss_weight=1.0 \
|
|
--instance_prompt="a photo of sks dog" \
|
|
--class_prompt="a photo of dog" \
|
|
--resolution=512 \
|
|
--train_batch_size=1 \
|
|
--gradient_accumulation_steps=2 --gradient_checkpointing \
|
|
--use_8bit_adam \
|
|
--learning_rate=5e-6 \
|
|
--lr_scheduler="constant" \
|
|
--lr_warmup_steps=0 \
|
|
--num_class_images=200 \
|
|
--max_train_steps=800
|
|
```
|
|
|
|
### Training on a 8 GB GPU:
|
|
|
|
By using [DeepSpeed](https://www.deepspeed.ai/) it's possible to offload some
|
|
tensors from VRAM to either CPU or NVME allowing to train with less VRAM.
|
|
|
|
DeepSpeed needs to be enabled with `accelerate config`. During configuration
|
|
answer yes to "Do you want to use DeepSpeed?". With DeepSpeed stage 2, fp16
|
|
mixed precision and offloading both parameters and optimizer state to cpu it's
|
|
possible to train on under 8 GB VRAM with a drawback of requiring significantly
|
|
more RAM (about 25 GB). See [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more DeepSpeed configuration options.
|
|
|
|
Changing the default Adam optimizer to DeepSpeed's special version of Adam
|
|
`deepspeed.ops.adam.DeepSpeedCPUAdam` gives a substantial speedup but enabling
|
|
it requires CUDA toolchain with the same version as pytorch. 8-bit optimizer
|
|
does not seem to be compatible with DeepSpeed at the moment.
|
|
|
|
```bash
|
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
|
export INSTANCE_DIR="path-to-instance-images"
|
|
export CLASS_DIR="path-to-class-images"
|
|
export OUTPUT_DIR="path-to-save-model"
|
|
|
|
accelerate launch train_dreambooth.py \
|
|
--pretrained_model_name_or_path=$MODEL_NAME \
|
|
--instance_data_dir=$INSTANCE_DIR \
|
|
--class_data_dir=$CLASS_DIR \
|
|
--output_dir=$OUTPUT_DIR \
|
|
--with_prior_preservation --prior_loss_weight=1.0 \
|
|
--instance_prompt="a photo of sks dog" \
|
|
--class_prompt="a photo of dog" \
|
|
--resolution=512 \
|
|
--train_batch_size=1 \
|
|
--gradient_accumulation_steps=1 --gradient_checkpointing \
|
|
--learning_rate=5e-6 \
|
|
--lr_scheduler="constant" \
|
|
--lr_warmup_steps=0 \
|
|
--num_class_images=200 \
|
|
--max_train_steps=800 \
|
|
--mixed_precision=fp16
|
|
```
|
|
|
|
### Fine-tune text encoder with the UNet.
|
|
|
|
The script also allows to fine-tune the `text_encoder` along with the `unet`. It's been observed experimentally that fine-tuning `text_encoder` gives much better results especially on faces.
|
|
Pass the `--train_text_encoder` argument to the script to enable training `text_encoder`.
|
|
|
|
___Note: Training text encoder requires more memory, with this option the training won't fit on 16GB GPU. It needs at least 24GB VRAM.___
|
|
|
|
```bash
|
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
|
export INSTANCE_DIR="path-to-instance-images"
|
|
export CLASS_DIR="path-to-class-images"
|
|
export OUTPUT_DIR="path-to-save-model"
|
|
|
|
accelerate launch train_dreambooth.py \
|
|
--pretrained_model_name_or_path=$MODEL_NAME \
|
|
--train_text_encoder \
|
|
--instance_data_dir=$INSTANCE_DIR \
|
|
--class_data_dir=$CLASS_DIR \
|
|
--output_dir=$OUTPUT_DIR \
|
|
--with_prior_preservation --prior_loss_weight=1.0 \
|
|
--instance_prompt="a photo of sks dog" \
|
|
--class_prompt="a photo of dog" \
|
|
--resolution=512 \
|
|
--train_batch_size=1 \
|
|
--use_8bit_adam
|
|
--gradient_checkpointing \
|
|
--learning_rate=2e-6 \
|
|
--lr_scheduler="constant" \
|
|
--lr_warmup_steps=0 \
|
|
--num_class_images=200 \
|
|
--max_train_steps=800
|
|
```
|
|
|
|
## Inference
|
|
|
|
Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline`. Make sure to include the `identifier`(e.g. sks in above example) in your prompt.
|
|
|
|
```python
|
|
from diffusers import StableDiffusionPipeline
|
|
import torch
|
|
|
|
model_id = "path-to-your-trained-model"
|
|
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
|
|
|
|
prompt = "A photo of sks dog in a bucket"
|
|
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
|
|
|
|
image.save("dog-bucket.png")
|
|
```
|