diff --git a/EveryDream-training.md b/EveryDream-training.md new file mode 100644 index 0000000..7b4b168 --- /dev/null +++ b/EveryDream-training.md @@ -0,0 +1,17 @@ +### Onward from DreamBooth + +While DreamBooth for Stable Diffusion has lead to a lot of success, limits on training multiple concepts required new techniques. Kane Wallmann created a fork of Xaiver Xiao's SD-Dreambooth code that enabled applying individual captions on training and regularization images, and from there much more power has been unlocked. + +Upon release I immediately trained several characters simultaneously from the recent video game Final Fantasy 7 Remake, with all of the four main characters represented simply captioned with their full names. From the first attempt it seemed to work just as well as training a single subject, but distortion in the original model was evident. These early attempts still used regularization images to pair with the concepts of "man" and "woman" for male and female characters as well. + +## Onward to fully captioning images + +The next logical step was to add individual captions to every training images to fully describe the scene. CLIP offers img2txt, though it will not understand the new concepts such as the characters, and often makes mistakes. Nevertheless, a combination of CLIP interrogation, scripting of replacing "a man" or "a woman" in captions, and a bit of labor in fixing up some duplications and errors can clean these up. Thus was born the FF7R V3 model. + +## How far can this go? + +Can style also be included along with multiple characters? Well it turns out yes. Adding in pictures of the world itself can be mixed in as well. Adding different districts, buildings, and landscape within a game world can add style transfer in a single training. Because of course it can. Want to draw "Gotham City in the style of Midgar City" (from Final Fantasy)? Just ask for it. + +## Forward the ~~Foundation~~ EveryDream + +The pending issue is trying to preserve the original capabilities of the model. Dreambooth attempts this via regularization, which is training on a grab-bag of images created by the model itself. While it can work, ground truth images should be better. \ No newline at end of file