82 lines
2.9 KiB
Markdown
82 lines
2.9 KiB
Markdown
|
## Textual Inversion fine-tuning example
|
||
|
|
||
|
[Textual inversion](https://arxiv.org/abs/2208.01618) is a method to personalize text2image models like stable diffusion on your own images using just 3-5 examples.
|
||
|
The `textual_inversion.py` script shows how to implement the training procedure and adapt it for stable diffusion.
|
||
|
|
||
|
### Installing the dependencies
|
||
|
|
||
|
Before running the scipts, make sure to install the library's training dependencies:
|
||
|
|
||
|
```bash
|
||
|
pip install diffusers[training] accelerate transformers
|
||
|
```
|
||
|
|
||
|
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
|
||
|
|
||
|
```bash
|
||
|
accelerate config
|
||
|
```
|
||
|
|
||
|
|
||
|
### Cat toy example
|
||
|
|
||
|
You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
|
||
|
|
||
|
You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
|
||
|
|
||
|
Run the following command to autheticate your token
|
||
|
|
||
|
```bash
|
||
|
huggingface-cli login
|
||
|
```
|
||
|
|
||
|
If you have already cloned the repo, then you won't need to go through these steps. You can simple remove the `--use_auth_token` arg from the following command.
|
||
|
|
||
|
<br>
|
||
|
|
||
|
Now let's get our dataset.Download 3-4 images from [here](https://drive.google.com/drive/folders/1fmJMs25nxS_rSNqS5hTcRdLem_YQXbq5) and save them in a directory. This will be our training data.
|
||
|
|
||
|
And launch the training using
|
||
|
|
||
|
```bash
|
||
|
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
|
||
|
export DATA_DIR="path-to-dir-containing-images"
|
||
|
|
||
|
accelerate launch textual_inversion.py \
|
||
|
--pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
|
||
|
--train_data_dir=$DATA_DIR \
|
||
|
--learnable_property="object" \
|
||
|
--placeholder_token="<cat-toy>" --initializer_token="toy" \
|
||
|
--resolution=512 \
|
||
|
--train_batch_size=1 \
|
||
|
--gradient_accumulation_steps=2 \
|
||
|
--max_train_steps=3000 \
|
||
|
--learning_rate=5.0e-04 --scale_lr \
|
||
|
--lr_scheduler="constant" \
|
||
|
--lr_warmup_steps=0 \
|
||
|
--output_dir="textual_inversion_cat"
|
||
|
```
|
||
|
|
||
|
A full training run takes ~1 hour on one V100 GPU.
|
||
|
|
||
|
|
||
|
### Inference
|
||
|
|
||
|
Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline`. Make sure to include the `placeholder_token` in your prompt.
|
||
|
|
||
|
|
||
|
```python
|
||
|
|
||
|
from torch import autocast
|
||
|
from diffusers import StableDiffusionPipeline
|
||
|
|
||
|
model_id = "path-to-your-trained-model"
|
||
|
pipe = pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
|
||
|
|
||
|
prompt = "A <cat-toy> backpack"
|
||
|
|
||
|
with autocast("cuda"):
|
||
|
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5)["sample"][0]
|
||
|
|
||
|
image.save("cat-backpack.png")
|
||
|
```
|