adding some links to experimental results that are posted on discord

2024-06-09 02:10:35 -04:00 · 2024-06-09 02:10:35 -04:00 · da60499728
parent 3b06e9f651
commit da60499728
2 changed files with 16 additions and 2 deletions
--- a/doc/ADVANCED_TWEAKING.md
+++ b/doc/ADVANCED_TWEAKING.md
@ -86,6 +86,8 @@ You can change the type of loss from the standard [MSE ("L2") loss](https://pyto

 mse_huber will use MSE at timestep 0 and huber at timestep 999, and interpolate between the two across the intermediate timesteps. huber_mse is the reverse

+[Experiment results](https://discord.com/channels/1026983422431862825/1229478214020235355) (Discord)
+
 ## LR tweaking

 You should use [Optimizer config](doc/OPTIMZER.md) to tweak instead of the primary arg here, but it is left for legacy support of the Jupyter Notebook to make it easier to use the Jupyter Notbook in a happy path or simplified scenario.
@ -308,7 +310,7 @@ Using the GPU for ema incurs only a small speed penalty of around 5-10% with all

 Generally, I recommend picking a device and approriate interval given your device choice first and stick with those values, then tweak the `ema_decay_rate` up or down according to how you want the EMA model to look vs. your non-EMA model.  From there, if your EMA model seems to "lag behind" the non-EMA model by "too much" (subjectively judged), you can decrease decay rate. If it identical or nearly identical, use a slightly higher value. 

-## ema_strength_target
+### ema_strength_target

 This arg is a non-standard way of calculating the actual decay rate used. It attempts to calculate a value for decay rate based on your `ema_update_interval` and the total length of training, compensating for both.  Values of 0.01-0.15 should work, with higher values leading to a EMA model that deviates more from the non-EMA model similar to how decay rate works.  It attempts to be more of a "strength" value of EMA, or "how much" (as a factor, i.e. 0.10 = 10% "strength") of the EMA model are kept for the totality of training.  

@ -318,6 +320,8 @@ While the calculation makes sense in how it compensates for inteval and total tr

 If you use `ema_strength_target` the actual calculated `ema_decay_rate` used will be printed in your logs, and you should pay attention to this value and use it to inform your future decisions on EMA tuning.

+[Experimental results](https://discord.com/channels/1026983422431862825/1150790432897388556) for EMA on Discord.
+
 ## AdaCoor optimizer

 This is an optimizer made by stripping out non functional components of CoordinateDoWG and several tweaks to high memory efficiency. It is a learning rate free adaptive optimizer where the only recommended parameter is an epsilon value of 1e-8. This optimizer does not scale well with high batch sizes, so it is recommended to use batch sizes no greater than 8 unless slow and careful training is desired.
@ -328,6 +332,10 @@ This is an implementation of pyramid noise as first introduced here https://wand

 Pyramid noise can be used to improve dynamic range in short finetunes of < 2000 steps at discounts greater than 0.40. At all discount levels pyramid noise appears to improve the amount of detail generated in images. However, it is not advised to train with pyramid noise for a full training as the noise affects the whole model rapidly and can destabilize the model if trained for too many steps. At 0, pyramid noise is disabled. 

+[Experimental results](https://discord.com/channels/1026983422431862825/1176398312870514788) (Discord)
+
 ## Attention Type

-The `attn_type` arg allows you to select `xformers`, `sdp`, or `slice`.  Xformers uses the [xformers package](https://github.com/facebookresearch/xformers).  SDP uses the scaled dot product mechanism [built into  Pytorch](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) as of recent Pytorch updates. Slice uses head splitting.  `sdp` is the default and suggested value as it seems to save a small amount of VRAM while also being approximately 5% faster than xformers.  There is likely little reason to use slice or xformers but are kept for the time being for experimentation or consistency with prior experiments.
+The `attn_type` arg allows you to select `xformers`, `sdp`, or `slice`.  Xformers uses the [xformers package](https://github.com/facebookresearch/xformers).  SDP uses the scaled dot product mechanism [built into  Pytorch](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) as of recent Pytorch updates. Slice uses head splitting.  `sdp` is the default and suggested value as it seems to save a small amount of VRAM while also being approximately 5% faster than xformers.  There is likely little reason to use slice or xformers but are kept for the time being for experimentation or consistency with prior experiments.
+
+[Experimental results](https://discord.com/channels/1026983422431862825/1178007113151287306) (Discord link)
--- a/doc/OPTIMIZER.md
+++ b/doc/OPTIMIZER.md
@ -33,6 +33,8 @@ Standard full precision AdamW optimizer exposed by PyTorch.  Not recommended.  S

 Tim Dettmers / bitsandbytes AdamW and Lion 8bit optimizer.  adamw8bit is the default and recommended setting as it is well understood, and lion8bit is very vram efficient.  Widely documented on the web.

+AdamW vs AdamW8bit: [Experimental results](https://discord.com/channels/1026983422431862825/1120697188427771924) on discord.
+
 * lion

 Lucidrains' [implementation](https://github.com/lucidrains/lion-pytorch) of the [lion optimizer](https://arxiv.org/abs/2302.06675).  Click links to read more.  `Epsilon` is not used by lion. You should prefer lion8bit over this optimizer as it is more memory efficient. 
@ -50,6 +52,8 @@ The recommendations are based on "1/10th LR" but "10x the weight decay" compared

 There are no known recommendations for the CLIP text encoder.  Using an even larger weight decay, increased epsilon, or even lower LR may be effective for the text encoder.  Further investigation on betas for text encoder is needed as well. 

+Some investigation on Lion tuning is [here](https://discord.com/channels/1026983422431862825/1098682949978820710) on Discord.
+
 #### D-Adaption optimizers

 [Dadaptation](https://arxiv.org/abs/2301.07733) [version](https://github.com/facebookresearch/dadaptation) of various optimizers.  
@ -90,6 +94,8 @@ This will freeze the text encoder up to the last 2 layers, leaving the earlier l

 Recommended settings for SD2.1 are provided in `optimizerSD21.json`. Unfreezing more layers will speed up training at the expense of text encoder stability. You can also try unfreezing the embeddings as well, by setting `"freeze_embeddings": false`. This may improve training, but it also seems to lead to quicker frying. 

+There are some [experimental results here](https://discord.com/channels/1026983422431862825/1106511648891609120) (Discord link) on layer freezing.
+
 ## General Beta, weight decay, epsilon, etc tuning

 Betas, weight decay, and epsilon are documented in the [AdamW paper](https://arxiv.org/abs/1711.05101) and there is a wealth of information on the web, but consider those experimental to tweak.