Split Train_Runpod.ipynb

Into RunPod-specific installer and generic training notebook. This should make it easier to support other providers
2023-02-09 21:18:32 +01:00 · 2023-02-09 21:18:32 +01:00 · b7afed5391
parent b73c135734
commit b7afed5391
2 changed files with 264 additions and 169 deletions
--- a/Train_JupyterLab.ipynb
+++ b/Train_JupyterLab.ipynb
@ -5,176 +5,77 @@
   "id": "2c831b5b-3025-4177-bef5-25aaec89573a",
   "metadata": {},
   "source": [
-    "## Every Dream v2 RunPod Setup\n",
+    "## Every Dream v2 Jupyter Notebook\n",
    "\n",
-    "[General Instructions](https://github.com/victorchall/EveryDream2trainer/blob/main/README.md)\n",
+    "### [General Instructions](https://github.com/victorchall/EveryDream2trainer/blob/main/README.md)\n",
    "\n",
-    "If you can sign up for Runpod here (shameless referral link): [Runpod](https://runpod.io/?ref=oko38cd0)\n",
+    "### What's your plan?\n",
+    "You will want to have your data prepared before starting, and have a rough training plan in mind. \n",
    "\n",
-    "If you are confused by the wall of text, join the discord here: [EveryDream Discord](https://discord.gg/uheqxU6sXN)\n",
+    "**Make sure your images are captioned!**\n",
    "\n",
-    "### Usage\n",
-    "\n",
-    "1. Prepare your training data before you begin (see below)\n",
-    "2. Spin the `RunPod Stable Diffusion v2.1` template. The `RunPod PyTorch` template does not work due to an old version of Python. \n",
-    "3. Open this notebook with `File > Open from URL...` pointing to `https://raw.githubusercontent.com/victorchall/EveryDream2trainer/main/Train_Runpod.ipynb`\n",
-    "4. Run each cell below once, noting any instructions above the cell (the first step requires a pod restart)\n",
-    "5. Figure out how you want to tweak the process next\n",
-    "6. Rinse, Repeat\n",
-    "\n",
-    "#### A note on storage\n",
-    "Remember, on RunPod time is more expensive than storage. \n",
-    "\n",
-    "Which is good, because running a lot of experiments can generate a lot of data. Not having the right save points to recover quickly from inevitable mistakes will cost you a lot of time.\n",
-    "\n",
-    "When in doubt, give yourself ~125GB of Runpod **Volume** storage.\n",
-    "\n",
-    "#### Preparing your training data\n",
-    "You will want to have your data prepared before starting, and have a rough training plan in mind. Don't waste rental fees if you're not fully prepared to start training.  \n",
+    "By default the name of your image files are assumed to be captions. If you want to get fancy, there are [more sophisticated techniques](https://github.com/victorchall/EveryDream2trainer/blob/main/doc/DATA.md)\n",
    "\n",
    "**If this is your first time trying a full fine-tune, start small!** \n",
-    "Pick a single concept and 30-100 images, and see what happens. Training a small dataset like this is fast, and will give you a feel for how quickly your model (over-)trains depending on your training schedule.\n",
    "\n",
-    "Your files should be captioned before you start with either the caption as the filename or in text files for each image alongside the image files.  See [DATA.md](https://github.com/victorchall/EveryDream2trainer/blob/main/doc/DATA.md) for more details. Tools are available to automatically caption your files."
+    "Pick a single concept and 30-100 images, and see what happens. \n",
+    "\n",
+    "Training a small dataset like this is fast, and will give you a feel for how quickly your model (over-)trains depending on your training schedule, captioning schema, knob twiddling. This notebook provides some sensible defaults, there are more questions than answers in how best to fine tune anything. \n",
+    "\n",
+    "**_When_ you have questions...**\n",
+    "\n",
+    "Come visit us at [EveryDream Discord](https://discord.gg/uheqxU6sXN)"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "5123d4e6-281c-4475-99fd-328f4d5df734",
+   "id": "7c73894e-3b5e-4268-9f83-ed89bd4569f2",
   "metadata": {},
   "source": [
-    "# For best results, restart the pod after the next cell completes\n",
+    "### (Optional) Weights and Biases login. \n",
    "\n",
-    "Here we ensure that EveryDream2trainer is installed, and we disable the Automatic 1111 web-ui. But the vram consumed by the web-ui will not be fully freed until the pod restarts. This is especially important if you are training with large batch sizes."
+    "Paste your token here if you have an account so you can use it to track your training progress.  If you don't have an account, you can create one for free at https://wandb.ai/site.  Log will use your project name from above. This is a free online logging utility.\n",
+    "\n",
+    "Your key is on this page: https://wandb.ai/settings under \"Danger Zone\" \"API Keys\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "bb6d14b7-3c37-4ec4-8559-16b4e9b8dd18",
+   "id": "cdbaf48c-f1e2-458d-b1ee-707f3b71bf61",
   "metadata": {},
   "outputs": [],
   "source": [
-    "import os\n",
+    "from ipywidgets import *\n",
    "\n",
-    "%cd /workspace\n",
-    "!echo pass > /workspace/stable-diffusion-webui/relauncher.py\n",
-    "if not os.path.exists(\"EveryDream2trainer\"):\n",
-    "    !git clone https://github.com/victorchall/EveryDream2trainer\n",
-    "\n",
-    "%cd EveryDream2trainer\n",
-    "%mkdir input\n",
-    "!python utils/get_yamls.py"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0902e735",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# When running on a pod designed for Automatic 1111 \n",
-    "# we need to kill the webui process to free up mem for training\n",
-    "!ps x | grep -E \"(relauncher|webui)\" | awk '{print $1}' | xargs kill $1\n",
-    "\n",
-    "# check system resources, make sure your GPU actually has 24GB\n",
-    "# You should see something like \"0 MB / 24576 MB\" in the middle of the printout\n",
-    "# if you see 0 MB / 22000 MB pick a beefier instance...\n",
-    "!grep Swap /proc/meminfo\n",
-    "!swapon -s\n",
-    "!nvidia-smi"
+    "wandb_token = Password(placeholder=\"Optional Weights & Biases auth token\")\n",
+    "out = Output()\n",
+    "def wandb_login(_):\n",
+    "    with out:\n",
+    "        if wandb_token.value:\n",
+    "          !wandb login {wandb_token.value}\n",
+    "      \n",
+    "wandb_btn = Button(description=\"W&B Login\")\n",
+    "wandb_btn.on_click(wandb_login)\n",
+    "print()\n",
+    "display(VBox([wandb_token, wandb_btn, out]))"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "0bf1e8cd",
+   "id": "3d9b0db8-c2b1-4f0a-b835-b6b2ef527019",
   "metadata": {},
   "source": [
-    "# Upload training files\n",
-    "\n",
-    "Ues the navigation on the left to open the ** \"workspace / EveryDream2trainer / input\"** and upload your training files using the **up arrow button** above the file explorer, or by dragging and dropping the files from your local machine onto the file explorer.\n",
-    "\n",
-    "If you have many training files, or nested folders of training data, create a zip archive of your training data, upload this file to the input folder, then right click on the zip file and select \"Extract Archive\".\n",
-    "\n",
-    "## Optional - Configure sample prompts\n",
-    "You can set your own sample prompts by adding them, one line at a time, to sample_prompts.txt.\n",
-    "\n",
-    "Keep in mind a longer list of prompts will take longer to generate. You may also want to adjust you sample_steps in the training notebook to a different value to get samples left often. This is probably a good idea when training a larger dataset that you know will take longer to train, where more frequent samples will not help you.\n",
-    "\n",
-    "While your training data is uploading, go ahead to install the dependencies below\n",
-    "----"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "589bfca0",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Install dependencies\n",
-    "\n",
-    "**This will take up to 15 minutes (if building xformers).  Wait until it says \"DONE\" to move on.** \n",
-    "You can ignore \"warnings.\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9649a02c-fb2b-44f1-842d-d1662fa5c7cd",
-   "metadata": {
-    "scrolled": true,
-    "tags": []
-   },
-   "outputs": [],
-   "source": [
-    "!python -m pip install --upgrade pip\n",
-    "\n",
-    "!pip install requests==2.25.1\n",
-    "!pip install -U -I torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url \"https://download.pytorch.org/whl/cu117\"\n",
-    "!pip install transformers==4.25.1\n",
-    "!pip install -U diffusers[torch]\n",
-    "\n",
-    "!pip install pynvml==11.4.1\n",
-    "!pip install bitsandbytes==0.35.0\n",
-    "!pip install ftfy==6.1.1\n",
-    "!pip install aiohttp==3.8.3\n",
-    "!pip install \"tensorboard>=2.11.0\"\n",
-    "!pip install protobuf==3.20.2\n",
-    "!pip install wandb==0.13.6\n",
-    "!pip install colorama==0.4.6\n",
-    "!pip install -U triton\n",
-    "!pip install --pre -U xformers\n",
-    "    \n",
-    "print(\"DONE\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c230d91a",
-   "metadata": {},
-   "source": [
-    "## Now that dependencies are installed, ready to move on!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "176af7b7-ebfe-4d25-a4a2-5c03489590ab",
-   "metadata": {},
-   "source": [
-    "## Log into huggingface\n",
+    "### HuggingFace Login\n",
    "Run the cell below and paste your token into the prompt.  You can get your token from your [huggingface account page](https://huggingface.co/settings/tokens).\n",
    "\n",
-    "The token will not show on the screen, just press enter after you paste it.\n",
-    "\n",
-    "Then run the following cell to download the base checkpoint (may take a minute)."
+    "The token will not show on the screen, just press enter after you paste it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "2a1d04ff-8a2c-46c6-a5de-baea1b3e5a2b",
+   "id": "138b7776-8783-4e1d-920d-cf358809b802",
   "metadata": {},
   "outputs": [],
   "source": [
@ -183,6 +84,14 @@
    "notebook_login()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "7a96f2af-8c93-4460-aa9e-2ff795fb06ea",
+   "metadata": {},
+   "source": [
+    "#### Then run the following cell to download the base checkpoint (may take a minute)."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -193,7 +102,6 @@
   },
   "outputs": [],
   "source": [
-    "%cd /workspace/EveryDream2trainer\n",
    "repo=\"panopstor/EveryDream\"\n",
    "ckpt_file=\"sd_v1-5_vae.ckpt\"\n",
    "\n",
@ -204,7 +112,7 @@
    "\n",
    "if not os.path.exists(f\"ckpt_cache/{ckpt_name}\"):\n",
    "    print(f\"Converting {ckpt_name} to Diffusers format\")\n",
-    "    !python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim \\\n",
+    "    %run utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim \\\n",
    "    --original_config_file v1-inference.yaml \\\n",
    "    --image_size 512 \\\n",
    "    --checkpoint_path \"{downloaded_model_path}\" \\\n",
@ -212,7 +120,6 @@
    "    --upcast_attn False \\\n",
    "    --dump_path \"ckpt_cache/{ckpt_name}\"\n",
    "\n",
-    "\n",
    "print(\"DONE\")"
   ]
  },
@ -243,38 +150,35 @@
   },
   "outputs": [],
   "source": [
-    "%cd /workspace/EveryDream2trainer\n",
-    "!python train.py --project_name \"sd1_mymodel_000\" \\\n",
-    "--resume_ckpt \"sd_v1-5_vae\" \\\n",
+    "%run train.py --config train.json {wandb} \\\n",
+    "--resume_ckpt \"{ckpt_name}\" \\\n",
+    "--project_name \"sd1_mymodel\" \\\n",
    "--data_root \"input\" \\\n",
-    "--resolution 512 \\\n",
-    "--batch_size 8 \\\n",
-    "--max_epochs 100 \\\n",
-    "--save_every_n_epochs 50 \\\n",
-    "--lr 1.8e-6 \\\n",
-    "--lr_scheduler cosine \\\n",
-    "--sample_steps 250 \\\n",
-    "--useadam8bit \\\n",
-    "--save_full_precision \\\n",
-    "--shuffle_tags \\\n",
-    "--amp \\\n",
-    "--write_schedule\n",
-    "\n",
-    "!python train.py --project_name \"sd1_mymodel_100\" \\\n",
-    "--resume_ckpt \"findlast\" \\\n",
-    "--data_root \"input\" \\\n",
-    "--resolution 512 \\\n",
-    "--batch_size 4 \\\n",
-    "--max_epochs 100 \\\n",
-    "--save_every_n_epochs 20 \\\n",
-    "--lr 1.0e-6 \\\n",
-    "--lr_scheduler constant \\\n",
+    "--max_epochs 200 \\\n",
    "--sample_steps 150 \\\n",
-    "--useadam8bit \\\n",
-    "--save_full_precision \\\n",
-    "--shuffle_tags \\\n",
-    "--amp \\\n",
-    "--write_schedule"
+    "--save_every_n_epochs 35 \\\n",
+    "--lr 1.5e-6 \\\n",
+    "--lr_scheduler constant"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed464c6b-1a8d-48e4-9787-265e8acaac43",
+   "metadata": {},
+   "source": [
+    "### Optionally you can chain trainings together using multiple configurations combined with `resume_ckpt: findlast`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "492350d4-9b2f-4d2a-9641-1f723125b296",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%run train.py --config chain0.json --project_name \"sd1_chain_a\" --data_root \"input\" --resume_ckpt \"{ckpt_name}\"\n",
+    "%run train.py --config chain1.json --project_name \"sd1_chain_b\" --data_root \"input\" --resume_ckpt findlast\n",
+    "%run train.py --config chain2.json --project_name \"sd1_chain_c\" --data_root \"input\" --resume_ckpt findlast"
   ]
  },
  {
@ -309,7 +213,7 @@
    "hfrepo = Text(placeholder='Your HF repo name')\n",
    "\n",
    "api = HfApi()\n",
-    "upload_btn = Button(description='Upload', layout=full_width)\n",
+    "upload_btn = Button(description='Upload')\n",
    "out = Output()\n",
    "\n",
    "def upload_ckpts(_):\n",
@ -349,7 +253,6 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%cd /workspace/EveryDream2trainer\n",
    "from ipywidgets import *\n",
    "from IPython.display import display, clear_output\n",
    "import os\n",
--- a/installers/Runpod.ipynb
+++ b/installers/Runpod.ipynb
@ -0,0 +1,192 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2c831b5b-3025-4177-bef5-25aaec89573a",
+   "metadata": {},
+   "source": [
+    "## Every Dream v2 RunPod Installer\n",
+    "\n",
+    "[General Instructions](https://github.com/victorchall/EveryDream2trainer/blob/main/README.md)\n",
+    "\n",
+    "You can sign up for Runpod here (shameless referral link): [Runpod](https://runpod.io/?ref=oko38cd0)\n",
+    "\n",
+    "### Usage\n",
+    "\n",
+    "1. Prepare your training data before you begin (see below)\n",
+    "2. Spin the `RunPod Stable Diffusion v2.1` template. The `RunPod PyTorch` template does not work due to an old version of Python. \n",
+    "3. Open this notebook with `File > Open from URL...` pointing to `https://raw.githubusercontent.com/victorchall/EveryDream2trainer/main/installers/Runpod.ipynb`\n",
+    "4. Run each cell below once, noting any instructions above the cell (the first step requires a pod restart)\n",
+    "5. Figure out how you want to tweak the process next\n",
+    "6. Rinse, Repeat\n",
+    "\n",
+    "#### A note on storage\n",
+    "\n",
+    "Remember, on RunPod time is more expensive than storage. \n",
+    "\n",
+    "Which is good, because running a lot of experiments can generate a lot of data. Not having the right save points to recover quickly from inevitable mistakes will cost you a lot of time.\n",
+    "\n",
+    "When in doubt, give yourself ~125GB of Runpod **Volume** storage.\n",
+    "\n",
+    "#### Are you ready?\n",
+    "\n",
+    "You will want to have your data prepared before starting, and have a rough training plan in mind. \n",
+    "\n",
+    "**Don't waste rental fees if you're not fully prepared to start training.**"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9cc4250a-bd89-4623-a188-7bb9fd3b99ec",
+   "metadata": {},
+   "source": [
+    "## Install EveryDream"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bb6d14b7-3c37-4ec4-8559-16b4e9b8dd18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "%cd /workspace\n",
+    "\n",
+    "if not os.path.exists(\"EveryDream2trainer\"):\n",
+    "    !git clone https://github.com/victorchall/EveryDream2trainer\n",
+    "\n",
+    "%cd EveryDream2trainer\n",
+    "%mkdir input\n",
+    "%run utils/get_yamls.py\n",
+    "\n",
+    "!echo pass > /workspace/stable-diffusion-webui/relauncher.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5123d4e6-281c-4475-99fd-328f4d5df734",
+   "metadata": {},
+   "source": [
+    "### Check your VRAM\n",
+    "If you see `22000 MB` or lower, then trash your pod and pick an A5000/3090 or better pod next time\n",
+    "\n",
+    "If you see `24576 MB` or higher you are good to go, but notice that there are `3500 MB` being taken up by Automatic 1111.\n",
+    "\n",
+    "Simply killing the web-ui won't free up that VRAM, but fortunately we added a hack to disable it above.\n",
+    "\n",
+    "Unfortunately it will require a pod restart once everything is installed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0902e735",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!grep Swap /proc/meminfo\n",
+    "!swapon -s\n",
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0bf1e8cd",
+   "metadata": {},
+   "source": [
+    "## Upload training files\n",
+    "\n",
+    "Ues the navigation on the left to open the ** \"workspace / EveryDream2trainer / input\"** and upload your training files using the **up arrow button** above the file explorer, or by dragging and dropping the files from your local machine onto the file explorer.\n",
+    "\n",
+    "If you have many training files, or nested folders of training data, create a zip archive of your training data, upload this file to the input folder, then right click on the zip file and select \"Extract Archive\".\n",
+    "\n",
+    "### Optional - Configure sample prompts\n",
+    "You can set your own sample prompts by adding them, one line at a time, to sample_prompts.txt.\n",
+    "\n",
+    "Keep in mind a longer list of prompts will take longer to generate. You may also want to adjust you sample_steps in the training notebook to a different value to get samples left often. This is probably a good idea when training a larger dataset that you know will take longer to train, where more frequent samples will not help you.\n",
+    "\n",
+    "## While your training data is uploading, go ahead to install the dependencies below\n",
+    "**This will a few minutes.  Wait until it says \"DONE\" to move on.** \n",
+    "You can ignore \"warnings.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9649a02c-fb2b-44f1-842d-d1662fa5c7cd",
+   "metadata": {
+    "scrolled": true,
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "!python -m pip install --upgrade pip\n",
+    "\n",
+    "!pip install requests==2.25.1\n",
+    "!pip install -U -I torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url \"https://download.pytorch.org/whl/cu117\"\n",
+    "!pip install transformers==4.25.1\n",
+    "!pip install -U diffusers[torch]\n",
+    "\n",
+    "!pip install pynvml==11.4.1\n",
+    "!pip install bitsandbytes==0.35.0\n",
+    "!pip install ftfy==6.1.1\n",
+    "!pip install aiohttp==3.8.3\n",
+    "!pip install \"tensorboard>=2.11.0\"\n",
+    "!pip install protobuf==3.20.2\n",
+    "!pip install wandb==0.13.6\n",
+    "!pip install colorama==0.4.5\n",
+    "!pip install -U triton\n",
+    "!pip install --pre -U xformers\n",
+    "    \n",
+    "print(\"DONE\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0889cec2-241e-4323-8463-23bd41ece7a3",
+   "metadata": {},
+   "source": [
+    "## RESTART (not reset) your pod now\n",
+    "The A1111 web ui will no longer load, and we will free up the rest of that VRAM. \n",
+    "\n",
+    "**_After restarting, reload_** this page and head on over to [EveryDream2trainer/Train_JupyterLab.ipynb](EveryDream2trainer/Train_JupyterLab.ipynb) to start training!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c8ba508f-7cf4-4f41-9d4d-2cf9975e6774",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "2e677f113ff5b533036843965d6e18980b635d0aedc1c5cebd058006c5afc92a"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}