Onward from DreamBooth
While DreamBooth for Stable Diffusion has lead to a lot of success, limits on training multiple concepts required new techniques. Kane Wallmann created a fork of Xaiver Xiao's SD-Dreambooth code that enabled applying individual captions on training and regularization images, and from there much more power has been unlocked.
Upon release I immediately trained several characters simultaneously from the recent video game Final Fantasy 7 Remake, with all of the four main characters represented simply captioned with their full names. From the first attempt it seemed to work just as well as training a single subject, but distortion in the original model was evident. These early attempts still used regularization images to pair with the concepts of "man" and "woman" for male and female characters as well, staying within the general confines of the original Dreambooth paper.
Onward to fully captioning images
The next logical step was to add individual captions to every training images to fully describe the scene. CLIP offers img2txt, though it will not understand the new concepts such as the characters, and often makes mistakes. Nevertheless, a combination of CLIP interrogation, scripting of replacing "a man" or "a woman" in captions with character names, and a bit of labor in fixing up some duplications and errors that come out of clip interrogation can clean these up. Thus was born the FF7R "V2" model.
How far can this go?
Can style also be included along with multiple characters? Well it turns out yes. Adding in pictures of the world itself can be mixed in as well. Adding different districts, buildings, and landscape within a game world can add style transfer in a single training. Because of course it can. Want to draw "Gotham City in the style of Midgar City" (from Final Fantasy)? Just ask for it. Thus "V3" was born. And from there, adding more data, more scenes, and more characters continued through to V4.
Forward the Foundation EveryDream
The pending issue is trying to preserve the original capabilities of the model while also continuing to grow the dataset. Dreambooth attempts this via "regularization," which is training on a grab-bag of images created by the model itself. The next logical step is to simply make the last step out of Dreambooth and stop using regularization images for model preservation and instead mix back in ground truth data. For the sake of clarity, I'm using a new name, EveryDream, for extending model training with new concepts and a "reasonable" mix of ground truth without attempting to train on a Laion dataset consisting of millions of images or more, which is impractical for hobbyists.