more doc cleanup

2023-01-23 15:47:07 -05:00 · 2023-01-23 15:47:07 -05:00 · 8c9e7f1db8
parent 4a5ddd0cd1
commit 8c9e7f1db8
1 changed files with 32 additions and 13 deletions
--- a/doc/TWEAKING.md
+++ b/doc/TWEAKING.md
@ -1,14 +1,14 @@
-# Tweaking settings
+# __Tweaking settings__

 This document should be read by all users who are trying to get the best results out of EveryDream 2.0.  These are the key settings you'll need to understand to get started.

-## Logging
+## __Logging__

 Make sure you pay attention to your logs and sample images.  Launch tensorboard in a second command line. See [logging](LOGGING.md) for more info.

    tensorboard --logdir logs

-## Project name
+## __Project name__

 Naming your project will help you track what the heck you're doing when you're floating in checkpoint files later.

@ -17,17 +17,22 @@ You may wish to consider adding "sd1" or "sd2v" or similar to remember what the
    --project_name "jets_sd21768v" ^


-## Stuff you probably want on
+## __Stuff you probably want on__

+### amp
    --amp

-Enables automatic mixed precision.  Greatly improved training speed and can help a bit with VRAM use.  [Torch](https://pytorch.org/docs/stable/amp.html) will automatically use FP16 precision for specific model components where FP16 is sufficient precision, and FP32 otherwise.  This also enables xformers to work with the SD1.x attention head schema.
+Enables automatic mixed precision.  Greatly improved training speed and will reduce VRAM use.  [Torch](https://pytorch.org/docs/stable/amp.html) will automatically use FP16 precision for specific model components where FP16 is sufficient precision, and FP32 otherwise.  This also enables xformers to work with the SD1.x attention head schema, which is a large speed boost for SD1.x training.  I highly suggest you always use this, but it is left as an option if you wish to disable.
+
+When amp is used with [gradient checkpointing](#gradient_checkpointing) you can run the trainer on 12GB GPUs and potentially 11GB.
+
+### useadam8bit 

    --useadam8bit

 Uses [Tim Dettmer's reduced precision AdamW 8 Bit optimizer](https://github.com/TimDettmers/bitsandbytes).  This seems to have no noticeable impact on quality but is considerable faster and more VRAM efficient. See more below in AdamW vs AdamW 8bit.

-## Epochs
+## __Epochs__

 EveryDream 2.0 has done away with repeats and instead you should set your max_epochs.  Changing epochs has the same effect as changing repeats in DreamBooth or EveryDream1.  For example, if you had 50 repeats and 5 epochs, you would now set max_epochs to 250 (50x5=250).  This is a bit more intuitive as there is no more double meaning for epochs and repeats.

@ -39,7 +44,7 @@ With more training data for your subjects and concepts, you can slowly scale thi

 With less training data, this value should be higher, because more repetition on the images is needed to learn.

-## Resolution
+## __Resolution__

 The resolution for training.  All buckets for multiaspect will be based on the total pixel count of your resolution squared. 

@ -49,7 +54,7 @@ Current supported resolutions can be printed by running the trainer without any

    python train.py

-## Save interval for checkpoints
+## __Save interval for checkpoints__

 While EveryDream 1.0 saved a checkpoint every epoch, this is no longer the case as it would produce too many files as "repeats" are removed in favor of just using epochs instead.  To balance the fact EveryDream users are sometimes training small datasets and sometimes huge datasets, you can now set the interval at which checkpoints are saved.  The default is 30 minutes, but you can change it to whatever you want. 

@ -71,17 +76,17 @@ If you are training a huge dataset (20k+) then saving every 1 epoch may not be v

 Diffusers copies of checkpoints are saved in your /logs/[project_name]/ckpts folder, and can be used to continue training if you want to pick up where you left off.  CKPT files are saved in the root training folder by default.  These folders can be changed. See [Advanced Tweaking](ATWEAKING.md) for more info.

-## Resuming training from previous runs
+## __Resuming training from previous runs__

 If you want to resume training from a previous run, you can do so by pointing to the diffusers copy in the logs folder from which you want to resume.  This is the same --resume_ckpt argument you would use to start training, just pointing to a different location.

    --resume_ckpt "logs\city_gradckptng2_20221231-234604\ckpts\last-city_gradckptng2-ep59-gs00600" ^

-## Learning Rate
+## __Learning Rate__

 The learning rate affects how much "training" is done on the model per training step.  It is a very careful balance to select a value that will learn your data.  See [Advanced Tweaking](ATWEAKING.md) for more info.  Once you have started, the learning rate is a good first knob to turn as you move into more advanced tweaking.

-## Batch Size
+## __Batch Size__

 Batch size is also another "hyperparamter" of itself and there are tradeoffs. It may not always be best to use the highest batch size possible.  Once of the primary reasons to change it is if you get "CUDA out of memory" errors where lowering the value may help.

@ -89,7 +94,7 @@ Batch size is also another "hyperparamter" of itself and there are tradeoffs. It

 While very small batch sizes can impact performance negatively, at some point larger sizes have little impact on overall speed as well, so shooting for the moon is not always advisable.  Changing batch size may also impact what learning rate you use, with typically larger batch_size requiring a slightly higher learning rate.  More info is provided in the [Advanced Tweaking](ATWEAKING.md) document.

-## LR Scheduler
+## __LR Scheduler__

 A learning rate scheduler can change your learning rate as training progresses.

@ -97,7 +102,7 @@ At this time, ED2.0 supports constant or cosine scheduler.

 The constant scheduler is the default and keeps your LR set to the value you set in the command line.  That's really it for constant!  I recommend sticking with it until you are comfortable with general training.  More info in the [Advanced Tweaking](ATWEAKING.md) document.

-## Sampling
+## __Sampling__

 You can set your own sample prompts by adding them, one line at a time, to sample_prompts.txt.  Or you can point to another file with --sample_prompts.

@ -110,3 +115,17 @@ Sample steps declares how often samples are generated and put into the logs and
    --sample_steps 300 ^

 Keep in mind if you drastically change your batch_size, the frequency (in time between samples) of samples will change.  Going from batch size 2 to batch size 10 may reduce how fast steps process, so you may want to reduce sample_steps to compensate.
+
+## __Gradient checkpointing__
+
+This is mostly useful to reduce VRAM for smaller GPUs, and together with AdamW 8 bit and AMP mode can enable <12GB GPU training.
+
+Gradient checkpointing can also offer a higher batch size and/or higher resolution within whatever VRAM you have, so it may be useful even on a 24GB+ GPU if you specifically want to run a very large batch size.  The other option is using gradient accumulation instead.
+
+    --gradient_checkpointing ^
+
+While gradient checkpointing reduces performance, the ability to run a higher batch size brings performance back fairly close to without it. 
+
+You may NOT want to use a batch size as large as 13-14+ on your 24GB+ GPU even if possible, or you may find you need to tweak learning rate all over again to find the right balance.  Generally I would not turn it on for a 24GB GPU training at <640 resolution.
+
+This probably IS a good idea for training at higher resolutions and allows >768 training on 24GB GPUs.  Balancing this toggle, resolution, and batch_size will take a few quick experiments to see what you can run safely.