EveryDream2trainer/Train_Runpod.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2c831b5b-3025-4177-bef5-25aaec89573a",
   "metadata": {},
   "source": [
    "## Every Dream v2 RunPod Setup\n",
    "\n",
    "[General Instructions](https://github.com/victorchall/EveryDream2trainer/blob/main/README.md)\n",
    "\n",
    "If you can sign up for Runpod here (shameless referral link): [Runpod](https://runpod.io/?ref=oko38cd0)\n",
    "\n",
    "If you are confused by the wall of text, join the discord here: [EveryDream Discord](https://discord.gg/uheqxU6sXN)\n",
    "\n",
    "### Requirements\n",
    "Select the `RunPod Stable Diffusion v2.1` template. The `RunPod PyTorch` template does not work due to an old version of Python. \n",
    "\n",
    "#### Storage\n",
    "Remember, on RunPod time is more expensive than storage. \n",
    "\n",
    "Which is good, because running a lot of experiments can generate a lot of data. Not having the right save points to recover quickly from inevitable mistakes will cost you a lot of time.\n",
    "\n",
    "When in doubt, give yourself ~125GB of Runpod **Volume** storage.\n",
    "\n",
    "#### Preparation\n",
    "You will want to have your data prepared before starting, and have a rough training plan in mind. Don't waste rental fees if you're not fully prepared to start training.  \n",
    "\n",
    "Your files should be captioned before you start with either the caption as the filename or in text files for each image alongside the image files.  See [DATA.md](https://github.com/victorchall/EveryDream2trainer/blob/main/doc/DATA.md) for more details. Tools are available to automatically caption your files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0902e735",
   "metadata": {},
   "outputs": [],
   "source": [
    "# When running on a pod designed for Automatic 1111 \n",
    "# we need to kill the webui process to free up mem for training\n",
    "!ps x | grep -E \"(relauncher|webui)\" | grep -v \"grep\" | awk '{print $1}' | xargs kill $1\n",
    "\n",
    "# check system resources, make sure your GPU actually has 24GB\n",
    "# You should see something like \"0 MB / 24576 MB\" in the middle of the printout\n",
    "# if you see 0 MB / 22000 MB pick a beefier instance...\n",
    "!grep Swap /proc/meminfo\n",
    "!swapon -s\n",
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb6d14b7-3c37-4ec4-8559-16b4e9b8dd18",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "%cd /workspace\n",
    "if not os.path.exists(\"EveryDream2trainer\"):\n",
    "    !git clone https://github.com/victorchall/EveryDream2trainer\n",
    "\n",
    "%cd EveryDream2trainer\n",
    "!python utils/get_yamls.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bf1e8cd",
   "metadata": {},
   "source": [
    "# Upload training files\n",
    "\n",
    "Ues the navigation on the left to open the ** \"workspace / EveryDream2trainer / input\"** and upload your training files using the **up arrow button** above the file explorer, or by dragging and dropping the files from your local machine onto the file explorer.\n",
    "\n",
    "If you have many training files, or nested folders of training data, create a zip archive of your training data, upload this file to the input folder, then right click on the zip file and select \"Extract Archive\".\n",
    "\n",
    "**While your training data is uploading, feel free to install the dependencies below**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "589bfca0",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Install dependencies\n",
    "\n",
    "**This will take up to 15 minutes (if building xformers).  Wait until it says \"DONE\" to move on.** \n",
    "You can ignore \"warnings.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9649a02c-fb2b-44f1-842d-d1662fa5c7cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m pip install --upgrade pip\n",
    "\n",
    "!pip install requests==2.25.1\n",
    "!pip install -U -I torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url \"https://download.pytorch.org/whl/cu116\"\n",
    "!pip install transformers==4.25.1\n",
    "!pip install -U diffusers[torch]==0.10.2\n",
    "\n",
    "!pip install pynvml==11.4.1\n",
    "!pip install bitsandbytes==0.35.0\n",
    "!pip install ftfy==6.1.1\n",
    "!pip install aiohttp==3.8.3\n",
    "!pip install \"tensorboard>=2.11.0\"\n",
    "!pip install protobuf==3.20.2\n",
    "!pip install wandb==0.13.6\n",
    "!pip install colorama==0.4.6\n",
    "!pip install -U triton\n",
    "!pip install -U ninja\n",
    "\n",
    "from subprocess import getoutput\n",
    "s = getoutput('nvidia-smi')\n",
    "\n",
    "if \"A100\" in s:\n",
    "    print(\"Detected A100, installing stable xformers\")\n",
    "    !pip install -U xformers\n",
    "else:\n",
    "    # A5000/3090/4090 support requires us to build xformers ourselves for now\n",
    "    print(\"Building xformers for SM86\")\n",
    "    !apt-get update && apt-get install -y gcc g++\n",
    "    !export TORCH_CUDA_ARCH_LIST=8.6 && pip install git+https://github.com/facebookresearch/xformers.git@48a77cc#egg=xformers\n",
    "\n",
    "print(\"DONE\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36105dbc-5a33-431b-b88e-b87d479d1ed7",
   "metadata": {},
   "source": [
    "\n",
    "# Optional - Configure sample prompts\n",
    "You can set your own sample prompts by adding them, one line at a time, to sample_prompts.txt.\n",
    "\n",
    "Keep in mind a longer list of prompts will take longer to generate. You may also want to adjust you sample_steps in the training notebook to a different value to get samples left often. This is probably a good idea when training a larger dataset that you know will take longer to train, where more frequent samples will not help you."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c230d91a",
   "metadata": {},
   "source": [
    "## Now that dependencies are installed, ready to move on!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "176af7b7-ebfe-4d25-a4a2-5c03489590ab",
   "metadata": {},
   "source": [
    "## Log into huggingface\n",
    "Run the cell below and paste your token into the prompt.  You can get your token from your [huggingface account page](https://huggingface.co/settings/tokens).\n",
    "\n",
    "The token will not show on the screen, just press enter after you paste it.\n",
    "\n",
    "Then run the following cell to download the base checkpoint (may take a minute)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a1d04ff-8a2c-46c6-a5de-baea1b3e5a2b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import notebook_login, hf_hub_download\n",
    "import os\n",
    "notebook_login()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86b66fe4-c2ca-46fa-813c-8fe390813add",
   "metadata": {},
   "outputs": [],
   "source": [
    "repo=\"panopstor/EveryDream\"\n",
    "ckpt_file=\"sd_v1-5_vae.ckpt\"\n",
    "\n",
    "print(f\"Downloading {ckpt_file} from {repo}\")\n",
    "downloaded_model_path = hf_hub_download(repo, ckpt_file, cache_dir=\"/workspace/hfcache\")\n",
    "ckpt_name = os.path.splitext(os.path.basename(downloaded_model_path))[0]\n",
    "print(f\"Downloaded {ckpt_name} to {downloaded_model_path}\")\n",
    "\n",
    "if not os.path.exists(f\"ckpt_cache/{ckpt_name}\"):\n",
    "    print(f\"Converting {ckpt_name} to Diffusers format\")\n",
    "    !python utils/convert_original_stable_diffusion_to_diffusers.py --scheduler_type ddim \\\n",
    "    --original_config_file v1-inference.yaml \\\n",
    "    --image_size 512 \\\n",
    "    --checkpoint_path \"{downloaded_model_path}\" \\\n",
    "    --prediction_type epsilon \\\n",
    "    --upcast_attn False \\\n",
    "    --dump_path \"ckpt_cache/{ckpt_name}\"\n",
    "\n",
    "\n",
    "print(\"DONE\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f15fcd56-0418-4be1-a5c3-38aa679b1aaf",
   "metadata": {},
   "source": [
    "# Start Training\n",
    "Naming your project will help you track what the heck you're doing when you're floating in checkpoint files later.\n",
    "\n",
    "You may wish to consider adding \"sd1\" or \"sd2v\" or similar to remember what the base was, as you'll also have to tell your inference app what you were using, as its difficult for programs to know what inference YAML to use automatically. For instance, Automatic1111 webui requires you to copy the v2 inference YAML and rename it to match your checkpoint name so it knows how to load the file, tough it assumes SD 1.x compatible. Something to keep in mind if you start training on SD2.1.\n",
    "\n",
    "`max_epochs`, `sample_steps`, and `save_every_n_epochs` should be tuned to your dataset. I like to generate one or two sets of samples per save, and aim for 5 (give or take 2) saved checkpoints.\n",
    "\n",
    "Next cell runs training. This will take a while depending on your number of images, repeats, and max_epochs.\n",
    "\n",
    "You can watch for test images in the logs folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f73fb86-ebef-41e2-9382-4aa11be84be6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python train.py --project_name \"ft_v1a_512_15e7\" \\\n",
    "--resume_ckpt \"{ckpt_name}\" \\\n",
    "--data_root \"input\" \\\n",
    "--resolution 512 \\\n",
    "--batch_size 4 \\\n",
    "--max_epochs 200 \\\n",
    "--save_every_n_epochs 25 \\\n",
    "--lr 1.5e-6 \\\n",
    "--lr_scheduler constant \\\n",
    "--sample_steps 50 \\\n",
    "--useadam8bit \\\n",
    "--save_full_precision \\\n",
    "--shuffle_tags \\\n",
    "--amp \\\n",
    "--write_schedule"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca170883-2d18-4f34-adbb-aca5824a351b",
   "metadata": {},
   "source": [
    "## You can chain togther a tuning schedule using `--resume_ckpt findlast`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f39291ef-bea9-471e-bbc7-7b681fa87eb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python train.py --project_name \"ft_v1b_512_07e7\" \\\n",
    "--resume_ckpt findlast \\\n",
    "--data_root \"input\" \\\n",
    "--resolution 512 \\\n",
    "--batch_size 4 \\\n",
    "--max_epochs 50 \\\n",
    "--save_every_n_epochs 25 \\\n",
    "--lr 0.7e-6 \\\n",
    "--lr_scheduler constant \\\n",
    "--sample_steps 50 \\\n",
    "--useadam8bit \\\n",
    "--save_full_precision \\\n",
    "--shuffle_tags \\\n",
    "--amp \\\n",
    "--write_schedule\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f24eee3d-f5df-45f3-9acc-ee0206cfe6b1",
   "metadata": {},
   "source": [
    "# HuggingFace upload\n",
    "Use the cell below to upload your checkpoint to your personal HuggingFace account if you want instead of manually downloading. You should already be authorized to Huggingface by token if you used the download/token cells above.\n",
    "\n",
    "Make sure to fill in the three fields at the top. This will only upload one file at a time, so you will need to run the cell below for each file you want to upload.\n",
    "\n",
    "* You can get your account name from your [HuggingFace account page](https://huggingface.co/settings/account). Look for your \"username\" field and paste it below.\n",
    "\n",
    "* You only need to setup a repository one time.  You can create it here: [Create New HF Model](https://huggingface.co/new)  Make sure you write down the repo name you make for future use.  You can reuse it later.\n",
    "\n",
    "* You need to type the name of the ckpts one at a time in the cell below, find them in the left file drawer of your Runpod/Vast/Colab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d5e8bd2-3e9f-4196-ad7a-ae6dcfc84b92",
   "metadata": {},
   "outputs": [],
   "source": [
    "#list ckpts in root that are ready for download\n",
    "import glob\n",
    "\n",
    "for f in glob.glob(\"*.ckpt\"):\n",
    "    print(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9df5e1a-3c68-41c0-a4ed-ea0abcd19858",
   "metadata": {},
   "outputs": [],
   "source": [
    "!huggingface-cli lfs-enable-largefiles\n",
    "# fill in these three fields:\n",
    "hfusername = \"MyHfUser\"\n",
    "reponame = \"MyRepo\"\n",
    "ckpt_name = \"ft_v1b_512_15e7-ep200-gs02500.ckpt\"\n",
    "\n",
    "\n",
    "target_name = ckpt_name.replace('-','').replace('=','')\n",
    "import os\n",
    "os.rename(ckpt_name,target_name)\n",
    "repo_id=f\"{hfusername}/{reponame}\"\n",
    "print(f\"uploading to HF: {repo_id}/{ckpt_name}\")\n",
    "print(\"this make take a while...\")\n",
    "\n",
    "from huggingface_hub import HfApi\n",
    "api = HfApi()\n",
    "response = api.upload_file(\n",
    "    path_or_fileobj=target_name,\n",
    "    path_in_repo=target_name,\n",
    "    repo_id=repo_id,\n",
    "    repo_type=None,\n",
    "    create_pr=1,\n",
    ")\n",
    "print(response)\n",
    "print(finish_msg)\n",
    "print(\"go to your repo and accept the PR this created to see your file\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "2e677f113ff5b533036843965d6e18980b635d0aedc1c5cebd058006c5afc92a"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}