6.6 KiB

Raw Permalink Blame History

What makes EveryDream different from Dreambooth?

EveryDream is a general case fine tuner. It does not explicitly implement the techniques from the Dreambooth paper.

That means there is no "class" or "token" or "regularization images". It simply trains image and caption pairs, much more like the original training of Stable Diffusion, just at a much smaller "at home" scale.

For the sake of those experienced in machine learning, foregive me for stretching and demarking some terms, as this is voiced for the typical user coming from Dreambooth training with the vocabulary as commonly used there.

What is "regularization" and "preservation"?

The Dreambooth technique uses the concept of adding generated images from the model itself to try to keep training from "veering too off course" and "damaging" the model while fine tuning a specific subject with just a handful of images. It served the purpose of "preserving" the integrity of the model. Early on in Dreambooth's lifecycle, people would train 5-20 images of their face, and use a few hundred or maybe a thousand "regularization" images along with the 5-20 training images of their new subject. Since then, many people want to scale to larger training, but more on that later...

It's very important to note these "regularization" images have been generated out of SD itself, using a simple prompt such as "person" or "woman". The prompt was meant as some sort of general "class" name that the subject you were trying to train fell under. For training your own face, "person" or "man" or "woman" has been very popular as "class". Likewise, training other things like your own dog, you might generate "dog" class/prompt regularization images for Dreambooth.

The purpose of the "regularization" images is to "preserve" the model so that not everything generated out of the model looks like your training subject, "bob smith" or whatever it may be. So that, hopefully, Tom Cruise does not suddenly look like your Bob Smith training images.

Often these regularization images are quite ugly, and they get trained back into the model by Dreambooth, reinforcing bad habits. This works short term for training one new face, but does not scale and eventually has the same problems of "damaging" the model and making it uglier overall.

I instead propose you replace images generated out of SD itself with original "ground truth" data.

Enter ground truth

"Ground truth" for the sake of this document means real images not generated by AI. It's very easy to get publicly available ML data sets to serve this purpose and replace generated "regularization" images with real photos or otherwise.

Sources include FFHQ, Coco, and Laion. There is a simple scraper to search Laion parquet files in the tools repo, and the Laion dataset was used by Compvis and Stability.ai themselves to train the base model.

Using ground truth allows fine tuning to scale to potentially as many images at you want, including training dozens of characters at once using thousands of images, which many EveryDream community members have done.

Using ground truth images for the general purpose of "presevation" will, instead of reinforcing bad habits by regularization, possibly help the model look better while training your actual training subjects, depending on how carefully you select your ground truth images.

Preservation more generally in EveryDream

"Preservation" images and "training" images have no special distinction in EveryDream. All images are treated the same and the trainer does not know the difference. It is all in how you use them.

Any preservation images still need a caption of some sort. Just "person" may be sufficient, for the sake of this particular example we're just trying to simulate Dreambooth. This can be as easy as selecting all the images, F2 rename, type person_ (with the underscore) and press enter. Windows will append (x) to every file to make sure the filenames are unique, and EveryDream interprets the underscore as the end of the caption when present in the filename, thus all the images will be read as having a caption of simply person, similar to how many people train Dreambooth.

You could also generate "person" regularization images out of any Stable Diffusion inference application or download one of the premade regularization sets, but I find this is less than ideal. For small training, regularization or preservation is simply not needed. For longer term training you're much better off mixing in real "ground truth" images into your data instead of generated data. "Ground truth" meaning images not generated from an AI. Training back on generated data will reinforce the errors in the model, like extra limbs, weird fingers, watermarks, etc. Using real ground truth data can actually help improve the model.

Therefore, I suggest using other web scrape data, or ML data sets like FFHQ, Imagenet, Laion, or Coco in place of generated images out of Stable Diffusion for model preservation like typical Dreambooth trainers do.

It's important to understand that the Dreambooth trainers out there are actually training on those "regularization" images as if they are training images. So, any ugly images in regularization sets are reinforced into the model. This is why I moved away from Dreambooth back in late October 2022 with EveryDream 1.0, which was rewritten in late 2022 and early 2023 into this repo, EveryDream 2.

Simulating Dreambooth

You can simulate what most Dreambooth trainers do somewhat by following some guidelines in the Data Balancing doc to mix in other (non-training) images for model preservation. Ex. if you are training 50 images of your face, you could grab 1000 images from the FFHQ face close up set and organize your data like this.

/my_data_root/training/  (your 50 training images)
/my_data_root/preservation/  (1000 images from FFHQ)

... And also place a file called multiply.txt in the preservation folder with a value of, for example, 0.025 typed in it. Train from "data_root": "/my_data_root" and it will mix in 25 random images (1000 x 0.025) from the preservation folder every epoch.

If you use Stable Diffusion generated images, this would be similar to Dreambooth, though the difference is still that the "regularizaiton" images are not specificly paired into batches with training images. They're mixed in randomly.

6.6 KiB Raw Permalink Blame History