## Every Dream v2 RunPod Setup

[General Instructions](https://github.com/victorchall/EveryDream2trainer/blob/main/README.md)

If you can sign up for Runpod here (shameless referral link): [Runpod](https://runpod.io/?ref=oko38cd0)

If you are confused by the wall of text, join the discord here: [EveryDream Discord](https://discord.gg/uheqxU6sXN)

### Usage

1. Prepare your training data before you begin (see below)
2. Spin the `RunPod Stable Diffusion v2.1` template. The `RunPod PyTorch` template does not work due to an old version of Python. 
3. Open this notebook with `File > Open from URL...` pointing to `https://raw.githubusercontent.com/victorchall/EveryDream2trainer/main/Train_Runpod.ipynb`
4. Run each cell below once, noting any instructions above the cell (the first step requires a pod restart)
5. Figure out how you want to tweak the process next
6. Rinse, Repeat

#### A note on storage
Remember, on RunPod time is more expensive than storage. 

Which is good, because running a lot of experiments can generate a lot of data. Not having the right save points to recover quickly from inevitable mistakes will cost you a lot of time.

When in doubt, give yourself ~125GB of Runpod **Volume** storage.

#### Preparing your training data
You will want to have your data prepared before starting, and have a rough training plan in mind. Don't waste rental fees if you're not fully prepared to start training.  

**If this is your first time trying a full fine-tune, start small!** 
Pick a single concept and 30-100 images, and see what happens. Training a small dataset like this is fast, and will give you a feel for how quickly your model (over-)trains depending on your training schedule.

Your files should be captioned before you start with either the caption as the filename or in text files for each image alongside the image files.  See [DATA.md](https://github.com/victorchall/EveryDream2trainer/blob/main/doc/DATA.md) for more details. Tools are available to automatically caption your files.

# For best results, restart the pod after the next cell completes

Here we ensure that EveryDream2trainer is installed, and we disable the Automatic 1111 web-ui. But the vram consumed by the web-ui will not be fully freed until the pod restarts. This is especially important if you are training with large batch sizes.

In [None]:
import os

%cd /workspace
!echo pass > /workspace/stable-diffusion-webui/relauncher.py
if not os.path.exists("EveryDream2trainer"):
    !git clone https://github.com/victorchall/EveryDream2trainer

%cd EveryDream2trainer
%mkdir input
!python utils/get_yamls.py

In [None]:
# When running on a pod designed for Automatic 1111 
# we need to kill the webui process to free up mem for training
!ps x | grep -E "(relauncher|webui)" | awk '{print $1}' | xargs kill $1

# check system resources, make sure your GPU actually has 24GB
# You should see something like "0 MB / 24576 MB" in the middle of the printout
# if you see 0 MB / 22000 MB pick a beefier instance...
!grep Swap /proc/meminfo
!swapon -s
!nvidia-smi

# Upload training files

Ues the navigation on the left to open the ** "workspace / EveryDream2trainer / input"** and upload your training files using the **up arrow button** above the file explorer, or by dragging and dropping the files from your local machine onto the file explorer.

If you have many training files, or nested folders of training data, create a zip archive of your training data, upload this file to the input folder, then right click on the zip file and select "Extract Archive".

## Optional - Configure sample prompts
You can set your own sample prompts by adding them, one line at a time, to sample_prompts.txt.

Keep in mind a longer list of prompts will take longer to generate. You may also want to adjust you sample_steps in the training notebook to a different value to get samples left often. This is probably a good idea when training a larger dataset that you know will take longer to train, where more frequent samples will not help you.

While your training data is uploading, go ahead to install the dependencies below
----

## Install dependencies

**This will take up to 15 minutes (if building xformers).  Wait until it says "DONE" to move on.** 
You can ignore "warnings."

In [None]:
!python -m pip install --upgrade pip

!pip install requests==2.25.1
!pip install -U -I torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url "https://download.pytorch.org/whl/cu117"
!pip install transformers==4.25.1
!pip install -U diffusers[torch]

!pip install pynvml==11.4.1
!pip install bitsandbytes==0.35.0
!pip install ftfy==6.1.1
!pip install aiohttp==3.8.3
!pip install "tensorboard>=2.11.0"
!pip install protobuf==3.20.2
!pip install wandb==0.13.6
!pip install colorama==0.4.6
!pip install -U triton
!pip install --pre -U xformers
    
print("DONE")

## Now that dependencies are installed, ready to move on!

## Log into huggingface
Run the cell below and paste your token into the prompt.  You can get your token from your [huggingface account page](https://huggingface.co/settings/tokens).

The token will not show on the screen, just press enter after you paste it.

Then run the following cell to download the base checkpoint (may take a minute).

In [None]:
from huggingface_hub import notebook_login, hf_hub_download
import os
notebook_login()

In [None]:
%cd /workspace/EveryDream2trainer
repo="panopstor/EveryDream"
ckpt_file="sd_v1-5_vae.ckpt"

print(f"Downloading {ckpt_file} from {repo}")
downloaded_model_path = hf_hub_download(repo, ckpt_file, cache_dir="/workspace/hfcache")
ckpt_name = os.path.splitext(os.path.basename(downloaded_model_path))[0]
print(f"Downloaded {ckpt_name} to {downloaded_model_path}")

if not os.path.exists(f"ckpt_cache/{ckpt_name}"):
    print(f"Converting {ckpt_name} to Diffusers format")
    !python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim \
    --original_config_file v1-inference.yaml \
    --image_size 512 \
    --checkpoint_path "{downloaded_model_path}" \
    --prediction_type epsilon \
    --upcast_attn False \
    --dump_path "ckpt_cache/{ckpt_name}"


print("DONE")

# Start Training
Naming your project will help you track what the heck you're doing when you're floating in checkpoint files later.

You may wish to consider adding "sd1" or "sd2v" or similar to remember what the base was, as you'll also have to tell your inference app what you were using, as its difficult for programs to know what inference YAML to use automatically. For instance, Automatic1111 webui requires you to copy the v2 inference YAML and rename it to match your checkpoint name so it knows how to load the file, tough it assumes SD 1.x compatible. Something to keep in mind if you start training on SD2.1.

`max_epochs`, `sample_steps`, and `save_every_n_epochs` should be tuned to your dataset. I like to generate one or two sets of samples per save, and aim for 5 (give or take 2) saved checkpoints.

Next cell runs training. This will take a while depending on your number of images, repeats, and max_epochs.

You can watch for test images in the logs folder.

In [None]:
%cd /workspace/EveryDream2trainer
!python train.py --project_name "sd1_mymodel_000" \
--resume_ckpt "sd_v1-5_vae" \
--data_root "input" \
--resolution 512 \
--batch_size 8 \
--max_epochs 100 \
--save_every_n_epochs 50 \
--lr 1.8e-6 \
--lr_scheduler cosine \
--sample_steps 250 \
--useadam8bit \
--save_full_precision \
--shuffle_tags \
--amp \
--write_schedule

!python train.py --project_name "sd1_mymodel_100" \
--resume_ckpt "findlast" \
--data_root "input" \
--resolution 512 \
--batch_size 4 \
--max_epochs 100 \
--save_every_n_epochs 20 \
--lr 1.0e-6 \
--lr_scheduler constant \
--sample_steps 150 \
--useadam8bit \
--save_full_precision \
--shuffle_tags \
--amp \
--write_schedule

# HuggingFace upload
Use the cell below to upload one or more checkpoints to your personal HuggingFace account, if you want, instead of manually downloading. You should already be authorized to Huggingface by token if you used the download/token cells above.

* You can get your account name from your [HuggingFace account page](https://huggingface.co/settings/account). Look for your "username" field and paste it below.

* You only need to setup a repository one time.  You can create it here: [Create New HF Model](https://huggingface.co/new)  Make sure you write down the repo name you make for future use.  You can reuse it later.

In [None]:
import glob
import os
from huggingface_hub import HfApi
from ipywidgets import *

all_ckpts = [f for f in glob.glob("*.ckpt")]
  
ckpt_picker = SelectMultiple(options=all_ckpts, layout=Layout(width="600px")) 
hfuser = Text(placeholder='Your HF user name')
hfrepo = Text(placeholder='Your HF repo name')

api = HfApi()
upload_btn = Button(description='Upload', layout=full_width)
out = Output()

def upload_ckpts(_):
    repo_id=f"{hfuser.value}/{hfrepo.value}"
    with out:
        for ckpt in ckpt_picker.value:
            print(f"Uploading to HF: huggingface.co/{repo_id}/{ckpt}")
            response = api.upload_file(
                path_or_fileobj=ckpt,
                path_in_repo=ckpt,
                repo_id=repo_id,
                repo_type=None,
                create_pr=1,
            )
            display(response)
        print("DONE")
        print("Go to your repo and accept the PRs this created to see your files")

upload_btn.on_click(upload_ckpts)
box = VBox([ckpt_picker, HBox([hfuser, hfrepo]), upload_btn, out])

display(box)

# Test inference on your checkpoints

In [None]:
%cd /workspace/EveryDream2trainer
from ipywidgets import *
from IPython.display import display, clear_output
import os
import gc
import random
import torch
import inspect

from torch import autocast
from diffusers import StableDiffusionPipeline, AutoencoderKL, UNet2DConditionModel, DDIMScheduler, DDPMScheduler, PNDMScheduler, EulerAncestralDiscreteScheduler
from transformers import CLIPTextModel, CLIPTokenizer


checkpoints_ts = []
for root, dirs, files in os.walk("."):
        for file in files:
            if os.path.basename(file) == "model_index.json":
                ts = os.path.getmtime(os.path.join(root,file))
                ckpt = root
                checkpoints_ts.append((ts, root))

checkpoints = [ckpt for (_, ckpt) in sorted(checkpoints_ts, reverse=True)]
full_width = Layout(width='600px')
half_width = Layout(width='300px')

checkpoint = Dropdown(options=checkpoints, description='Checkpoint:', layout=full_width)
prompt = Textarea(value='a photo of ', description='Prompt:', layout=full_width)
height = IntSlider(value=512, min=256, max=768, step=32, description='Height:', layout=half_width)
width = IntSlider(value=512, min=256, max=768, step=32, description='Width:', layout=half_width)
cfg = FloatSlider(value=7.0, min=0.0, max=14.0, step=0.2, description='CFG Scale:', layout=half_width)
steps = IntSlider(value=30, min=10, max=100, description='Steps:', layout=half_width)
seed = IntText(value=-1, description='Seed:', layout=half_width)
generate_btn = Button(description='Generate', layout=full_width)
out = Output()

def generate(_):
    with out:
        clear_output()
        display(f"Loading model {checkpoint.value}")
        actual_seed = seed.value if seed.value != -1 else random.randint(0, 2**30)

        text_encoder = CLIPTextModel.from_pretrained(checkpoint.value, subfolder="text_encoder")
        vae = AutoencoderKL.from_pretrained(checkpoint.value, subfolder="vae")
        unet = UNet2DConditionModel.from_pretrained(checkpoint.value, subfolder="unet")
        tokenizer = CLIPTokenizer.from_pretrained(checkpoint.value, subfolder="tokenizer", use_fast=False)
        scheduler = DDIMScheduler.from_pretrained(checkpoint.value, subfolder="scheduler")
        text_encoder.eval()
        vae.eval()
        unet.eval()

        text_encoder.to("cuda")
        vae.to("cuda")
        unet.to("cuda")

        pipe = StableDiffusionPipeline(
            vae=vae,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            unet=unet,
            scheduler=scheduler,
            safety_checker=None, # save vram
            requires_safety_checker=None, # avoid nag
            feature_extractor=None, # must be none of no safety checker
        )

        pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
        
        print(inspect.cleandoc(f"""
              Prompt: {prompt.value}
              Resolution: {width.value}x{height.value}
              CFG: {cfg.value}
              Steps: {steps.value}
              Seed: {actual_seed}
              """))
        with autocast("cuda"):
            image = pipe(prompt.value, 
                generator=torch.Generator("cuda").manual_seed(actual_seed),
                num_inference_steps=steps.value, 
                guidance_scale=cfg.value,
                width=width.value,
                height=height.value
            ).images[0]
        del pipe
        gc.collect()
        with torch.cuda.device("cuda"):
            torch.cuda.empty_cache()
            torch.cuda.ipc_collect()
        display(image)
            
generate_btn.on_click(generate)
box = VBox(
    children=[
        checkpoint, prompt, 
        HBox([VBox([width, height]), VBox([steps, cfg])]), 
        seed, 
        generate_btn, 
        out]
)


display(box)