runpod maybe working ish

This commit is contained in:
Victor Hall 2022-11-08 23:00:54 -05:00
parent 383df44a7b
commit 06a3e48237
13 changed files with 421 additions and 8 deletions

326
Train-Runpod.ipynb Normal file
View File

@ -0,0 +1,326 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "676114ae",
"metadata": {},
"source": [
"## Every Dream trainer\n",
"\n",
"You will need your data prepared first before starting! Don't waste rental fees if you're not ready to upload your files. Your files should be captioned before you start with either the caption as the filename or in text files for each image alongside the image files. See main README.md for more details. Tools are available to automatically caption your files.\n",
"\n",
"[Instructions](https://github.com/victorchall/EveryDream-trainer/blob/main/README.md)\n",
"\n",
"If you can sign up for Runpod here (shameless referral link): [Runpod](https://runpod.io?ref=oko38cd0)\n",
"\n",
"If you are confused by the wall of text, join the discord here: [EveryDream Discord](https://discord.gg/uheqxU6sXN)\n",
"\n",
"Make sure you have at least 40GB of Runpod **Volume** storage at a minimum so you don't waste training just 1 ckpt that is overtrained and have to start over. Penny pinching on storage is ultimately a waste of your time and money! This is setup to give you more than one ckpt so you don't overtrain.\n",
"\n",
"### Starting model\n",
"Make sure you have your hugging face token ready to download the 1.5 mode. You can get one here: https://huggingface.co/settings/tokens\n",
"If you don't have a User Access Token, create one. Or you can upload a starting checkpoint instead of using the HF download and skip that step, but you'll need to modify the starting model name when you start training (more info below)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb6d14b7-3c37-4ec4-8559-16b4e9b8dd18",
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/victorchall/everydream-trainer\n",
"%cd everydream-trainer"
]
},
{
"cell_type": "markdown",
"id": "589bfca0",
"metadata": {},
"source": [
"## Install dependencies\n",
"You can ignore \"warnings.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab559338",
"metadata": {},
"outputs": [],
"source": [
"# BUILD ENV\n",
"!pip install -q omegaconf\n",
"!pip install -q einops\n",
"!pip install -q pytorch-lightning==1.6.5\n",
"!pip install -q test-tube\n",
"!pip install -q transformers==4.19.2\n",
"!pip install -q kornia\n",
"!pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers\n",
"!pip install -e git+https://github.com/openai/CLIP.git@main#egg=clip\n",
"!pip install -q setuptools==59.5.0\n",
"!pip install -q pillow==9.0.1\n",
"!pip install -q torchmetrics==0.6.0\n",
"!pip install -e .\n",
"#!pip install -qq diffusers[\"training\"]==0.3.0 transformers ftfy\n",
"!pip install -qq ipywidgets==8.0.2\n",
"!pip install huggingface_hub\n",
"#!pip install ipywidgets==7.7.1\n",
"import ipywidgets as widgets"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55716da3-7229-45e0-b8c1-2b25466fd126",
"metadata": {},
"outputs": [],
"source": [
"!pip install omegaconf\n",
"!pip install albumentations==1.1.0\n",
"!pip install transformers==4.19.2\n",
"!pip install torchvision==0.13.1\n",
"!pip install pudb==2019.2\n",
"!pip install imageio==2.14.1\n",
"!pip install imageio-ffmpeg==0.4.7\n",
"!pip install test-tube>=0.7.5\n",
"!pip install einops==0.4.1\n",
"!pip install pillow==9.0.1\n",
"!pip install torch-fidelity==0.3.0\n",
"!pip install torchmetrics==0.6.0\n",
"!pip install kornia==0.6\n",
"!pip install huggingface_hub\n",
"!pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers\n",
"!pip install -e git+https://github.com/openai/CLIP.git@main#egg=clip\n",
"!pip install -e ."
]
},
{
"cell_type": "markdown",
"id": "c230d91a",
"metadata": {},
"source": [
"## Now that dependencies are installed, ready to move on!"
]
},
{
"cell_type": "markdown",
"id": "17affc47",
"metadata": {},
"source": [
"## Log into huggingface\n",
"Run the cell below and paste your token into the prompt. You can get your token from your huggingface account page.\n",
"\n",
"The token will not show on the screen, just press enter after you paste it.\n",
"\n",
"Then run the following cell to download the base checkpoint (may take a minute)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "02c8583e",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "39a1acc3a2914d9797fc2f8ff11a9a69",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from huggingface_hub import notebook_login\n",
"\n",
"notebook_login()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "503322f5",
"metadata": {},
"outputs": [],
"source": [
"from huggingface_hub import hf_hub_download\n",
"downloaded_model_path = hf_hub_download(\n",
" repo_id=\"runwayml/stable-diffusion-v1-5\",\n",
" filename=\"v1-5-pruned.ckpt\",\n",
" use_auth_token=True\n",
")"
]
},
{
"cell_type": "markdown",
"id": "cf8a98c2",
"metadata": {},
"source": [
"## Make an input folder for your training images"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a886c9d",
"metadata": {},
"outputs": [],
"source": [
"!dir"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "91a3ff2f-3f80-45e3-a616-6e54edc9ff13",
"metadata": {},
"outputs": [],
"source": [
"!mkdir input #makes an input folder, UPLOAD YOUR TRAINING IMAGES THERE"
]
},
{
"cell_type": "markdown",
"id": "0bf1e8cd",
"metadata": {},
"source": [
"# Upload training files\n",
"\n",
"Ues the navigation on the left to upload your training files in the input folder. Use the File menu to upload files. You can upload multiple files at once. You can also upload multiple folders under the input folder if you want.\n",
"\n",
"You can check there are files in the folder by running the cell below (optional, just prints first 10)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fb380279-360f-4109-89ae-fb07767ab512",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"File Not Found\n",
"'head' is not recognized as an internal or external command,\n",
"operable program or batch file.\n"
]
}
],
"source": [
"!ls -U input | head -10"
]
},
{
"cell_type": "markdown",
"id": "873d9f3f",
"metadata": {},
"source": [
"## Tweak your YAML\n",
"You can adjust the YAML file to change the training parameters. \n",
"\n",
"Instructions are here: https://github.com/victorchall/EveryDream-trainer/blob/main/README.md\n",
"\n",
"[Runpod YAML](everydream-trainer/configs/stable-diffusion/v1-finetune_runpod.yaml) is a good starting point for small datasets (30-50 images) and is the default in the command below. It will only keep 2 checkpoints.\n",
"\n",
"[EveryDream YAML](workspace/everydream-trainer/configs/stable-diffusion/v1-finetune_everydream.yaml) is a good starting point for large datasets. You will need to change the filename in the --config parameter below to use this. This may create a LOT of large ckpt files while training, so make sure you have enough space in your runpod instance! 60GB+ is recommended."
]
},
{
"cell_type": "markdown",
"id": "25ea006b",
"metadata": {},
"source": [
"# Run the trainer\n",
"This will take a while. Make sure when it finishes you scroll down to run the last cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c12e7cf3-42be-4537-a4f7-5723c0248562",
"metadata": {},
"outputs": [],
"source": [
"# run the trainer, wait until it finishes then SCROLL DOWN to the next cell\n",
"!python main.py --base configs/stable-diffusion/v1-finetune_runpod.yaml -t --actual_resume \"v1-5-pruned.ckpt\" -n test --data_root input"
]
},
{
"cell_type": "markdown",
"id": "6664e5e7",
"metadata": {},
"source": [
"## prune checkpoints\n",
"This will create 2GB pruned files for all your "
]
},
{
"cell_type": "markdown",
"id": "e8c93085",
"metadata": {},
"source": [
"## Prune your checkpoints\n",
"This will create 2GB pruned files for all your checkpoints. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e70ae7e",
"metadata": {},
"outputs": [],
"source": [
"# prune the ckpts\n",
"!python auto_prune_all.py --delete"
]
},
{
"cell_type": "markdown",
"id": "51456afe",
"metadata": {},
"source": [
"## Download your checkpoints\n",
"\n",
"Use the file explorer on the left, go into the \"every-dream-trainer\" folder.\n",
"\n",
"Look for all the ckpt files that say \"-pruned\" on the end. Download them and you're done! \n",
"\n",
"[EveryDream Discord](https://discord.gg/uheqxU6sXN)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.13 ('everydream')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "2e677f113ff5b533036843965d6e18980b635d0aedc1c5cebd058006c5afc92a"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -76,7 +76,7 @@ data:
validation: validation:
target: ldm.data.ed_validate.EDValidateBatch target: ldm.data.ed_validate.EDValidateBatch
params: params:
repeats: 0.4 repeats: 1
test: test:
target: ldm.data.ed_validate.EDValidateBatch target: ldm.data.ed_validate.EDValidateBatch
params: params:

View File

@ -66,7 +66,7 @@ data:
target: main.DataModuleFromConfig target: main.DataModuleFromConfig
params: params:
batch_size: 4 batch_size: 4
num_workers: 8 num_workers: 1
wrap: falsegit wrap: falsegit
train: train:
target: ldm.data.every_dream.EveryDreamBatch target: ldm.data.every_dream.EveryDreamBatch

View File

@ -80,13 +80,13 @@ data:
train: train:
target: ldm.data.every_dream.EveryDreamBatch target: ldm.data.every_dream.EveryDreamBatch
params: params:
repeats: 5 repeats: 10
flip_p: 0 flip_p: 0
debug_level: 1 debug_level: 1
validation: validation:
target: ldm.data.ed_validate.EDValidateBatch target: ldm.data.ed_validate.EDValidateBatch
params: params:
repeats: 0.1 repeats: 0.5
test: test:
target: ldm.data.ed_validate.EDValidateBatch target: ldm.data.ed_validate.EDValidateBatch
params: params:
@ -98,7 +98,7 @@ lightning:
every_n_epochs: 1 every_n_epochs: 1
#every_n_train_steps: 1400 # can only use every_n_epochs OR every_n_train_steps, suggest you stick with epochs #every_n_train_steps: 1400 # can only use every_n_epochs OR every_n_train_steps, suggest you stick with epochs
save_last: True save_last: True
save_top_k: 2 save_top_k: 5
filename: "{epoch:02d}-{step:05d}" filename: "{epoch:02d}-{step:05d}"
callbacks: callbacks:
image_logger: image_logger:
@ -110,7 +110,7 @@ lightning:
trainer: trainer:
benchmark: True benchmark: True
max_epochs: 10 max_epochs: 8
max_steps: 99000 # better to end on epochs not steps, especially with >500 images to ensure even distribution, but you can set this if you really want... max_steps: 99000 # better to end on epochs not steps, especially with >500 images to ensure even distribution, but you can set this if you really want...
check_val_every_n_epoch: 1 check_val_every_n_epoch: 1
gpus: 0, gpus: 0,

BIN
demo/runpodconnect.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

BIN
demo/runpodinstances.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

BIN
demo/runpodopenurl.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

BIN
demo/runpodsetup.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

BIN
demo/runpodstop.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

BIN
demo/runpodupload.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

50
doc/RUNPOD.MD Normal file
View File

@ -0,0 +1,50 @@
# Runpod
You will need your data prepared first before starting! Don't waste rental fees if you're not ready to upload your files. Your files should be captioned before you start with either the caption as the filename or in text files for each image alongside the image files. See main README.md for more details. Tools are available to automatically caption your files on a 4GB GPU or Colab notebook.
[Main readme](https://github.com/victorchall/EveryDream-trainer/blob/main/README.md)
You can sign up for Runpod here (*shameless referral link*): [Runpod](https://runpod.io?ref=oko38cd0)
If you are confused by the wall of text, join the discord here: [EveryDream Discord](https://discord.gg/uheqxU6sXN)
Make sure you have at least 50GB of Runpod **Volume** storage at a minimum so you don't waste training just 1 ckpt that is overtrained and have to start over. Penny pinching on storage is ultimately a waste of your time and money! This is setup to give you more than one ckpt so you don't overtrain.
## Getting started
1. Pick a 24GB GPU instance (community or secure). You can use 48GB but it is unnecessary. Make sure to use "ON DEMAND" not "SPOT" or your instance may be closed suddently and you will lose your training. Once you click, go "My Pods"
![r](../demo/runpodinstances.png)
1. Start a Pytorch instance in Runpod. Make sure to get plenty of volume space!
![r](../demo/runpodsetup.png)
2. Launch the Juypyter notebook.
![r](../demo/runpodconnect.png)
3. Click file, open from URL, and paste in this URL: https://raw.githubusercontent.com/victorchall/EveryDream-trainer/main/Train-Runpod.ipynb
![r](../demo/runpodopenurl.png)
4. Rest of the instructions are in the notebook, but you'll upload your files here once ready:
![r](../demo/runpodupload.png)
5. Make sure to go back to Runpod.io when you are done (don't forget to download your PRUNED.CKPT files) and STOP your instance and also click the **trash can button** to remove the volume storage. You will be charged for the storage if you don't delete it.
![r](../demo/runpodstop.png)
# Advanced mode
You can train MUCH larger models with more data, potentially unlimited numbers of characters and styles into one model. You will follow the same steps, but the expectation is you are using a much larger data set, say, many hundreds or thousands of images, and are willing to train for many hours. There are many things to consider on each project and different projects have different requirements so it is hard to generalize.
You can read up on a model trained with 7+ characters and a variety of cityscapes using 1600+ new training images and 1600+ preservation images here: [FF7R Mega Model on Huggingface](https://huggingface.co/panopstor/ff7r-stable-diffusion)
You will want even more volume space. 100GB is advised so you can get MANY ckpts (many epochs) you can test along the way, download them, stop your instance, test, and resume from them again if needed. This will save you a lot of heartache with undertrained or overtrained models, especially if you are training for 6-10+ hours.
You will change the yaml in the training step to v1-finetune-everydream.yaml, and also want to consider tweaking the values and a long, careful read of the main README, and training more basic models first.
!python main.py --base configs/stable-diffusion/v1-finetune_everydream.yaml -t --actual_resume "v1-5-pruned.ckpt" -n test --data_root input
This file is configured better for very large training sets. Repeats is reduced.

0
input/.gitkeep Normal file
View File

View File

@ -2,6 +2,8 @@
import PIL import PIL
import numpy as np import numpy as np
from torchvision import transforms from torchvision import transforms
import random
import math
class ImageTrainItem(): # [image, identifier, target_aspect, closest_aspect_wh[w,h], pathname] class ImageTrainItem(): # [image, identifier, target_aspect, closest_aspect_wh[w,h], pathname]
def __init__(self, image: PIL.Image, caption: str, target_wh: list, pathname: str, flip_p=0.0): def __init__(self, image: PIL.Image, caption: str, target_wh: list, pathname: str, flip_p=0.0):
@ -9,6 +11,7 @@ class ImageTrainItem(): # [image, identifier, target_aspect, closest_aspect_wh[w
self.target_wh = target_wh self.target_wh = target_wh
self.pathname = pathname self.pathname = pathname
self.flip = transforms.RandomHorizontalFlip(p=flip_p) self.flip = transforms.RandomHorizontalFlip(p=flip_p)
self.cropped_img = None
if image is None: if image is None:
self.image = PIL.Image.new(mode='RGB',size=(1,1)) self.image = PIL.Image.new(mode='RGB',size=(1,1))
@ -19,11 +22,45 @@ class ImageTrainItem(): # [image, identifier, target_aspect, closest_aspect_wh[w
if type(self.image) is not np.ndarray: if type(self.image) is not np.ndarray:
self.image = PIL.Image.open(self.pathname).convert('RGB') self.image = PIL.Image.open(self.pathname).convert('RGB')
self.image = self.image.resize((self.target_wh), PIL.Image.BICUBIC) cropped_img = self.__autocrop(self.image)
self.image = cropped_img.resize((512,512), PIL.Image.BICUBIC)
self.image = self.flip(self.image) self.image = self.flip(self.image)
self.image = np.array(self.image).astype(np.uint8) self.image = np.array(self.image).astype(np.uint8)
self.image = (self.image / 127.5 - 1.0).astype(np.float32) self.image = (self.image / 127.5 - 1.0).astype(np.float32)
return self return self
@staticmethod
def __autocrop(image: PIL.Image, q=.404):
x, y = image.size
if x != y:
if (x>y):
rand_x = x-y
rand_y = 0
sigma = max(rand_x*q,1)
else:
rand_x = 0
rand_y = y-x
sigma = max(rand_y*q,1)
if (x>y):
x_crop_gauss = abs(random.gauss(0, sigma))
x_crop = min(x_crop_gauss,(x-y)/2)
x_crop = math.trunc(x_crop)
y_crop = 0
else:
y_crop_gauss = abs(random.gauss(0, sigma))
x_crop = 0
y_crop = min(y_crop_gauss,(y-x)/2)
y_crop = math.trunc(y_crop)
min_xy = min(x, y)
image = image.crop((x_crop, y_crop, x_crop + min_xy, y_crop + min_xy))
#print(f"crop: {x_crop} {y_crop}, {x} {y} => {image.size}")
return image