8.0 KiB

Raw Blame History

Waifu Diffusion 1.3 Release Notes

HuggingFace Page for model download: https://huggingface.co/hakurei/waifu-diffusion-v1-3

Model Overview
Training Process
Prompting
License
Sample Generations
Team Members and Acknowledgements

Model Overview

The Waifu Diffusion 1.3 model is a Stable Diffusion model that has been finetuned from Stable Diffusion v1.4. I would like to personally thank everyone that had been involved with the development and release of Stable Diffusion, as all of this work for Waifu Diffusion would not have been possible without their original codebase and pre-existing model weights from which Waifu Diffusion was finetuned from.

The data used for finetuning Waifu Diffusion 1.3 was 680k text-image samples that had been downloaded through a booru site that provides high-quality tagging and original sources to the artworks themselves that are uploaded to the site. I also want to personally thank them as well, as without their hardwork the generative quality from this model would not have been feasible without going to financially extreme lengths to acquiring the data to use for training. The Booru in question would also like to remain anonymous due to the current climate regarding AI generated imagery.

Within the HuggingFace Waifu Diffusion 1.3 Repository are 4 final models:

Float 16 EMA Pruned: This is the smallest available form for the model at 2GB. This model is to be used for inference purposes only.
Float 32 EMA Pruned: The float32 weights are the second smallest available form of the model at 4GB. This is to be used for inference purposes only.
Float 32 Full Weights: The full weights contain the EMA weights which are not used during inference. These can be used for either training or inference.
Float 32 Full Weights + Optimizer Weights: The optimizer weights contain all of the optimizer states used during training. It is 14GB large and there is no quality difference between this model and the others as this model is to be used for training purposes only.

Various modifications to the data had been made since the Waifu Diffusion 1.2 model which included:

Removing underscores.
Removing parenthesis.
Separating each booru tag with a comma.
Randomizing tag order.

Training Process

The finetuning had been conducted with a fork from the original Stable Diffusion codebase, which is known as Waifu Diffusion. The differences between the main repo and the fork is that the fork includes fixes to the original training code as well as a custom dataloader to train on text-image pairs that are stored locally.

For finetuning, the base model Stable Diffusion 1.4 had been finetuned on 680k text-image samples for 10 epochs at a flat learning rate of 5e-6.

The hardware used was a GPU VM instance which had the following specs:

8x 48GB A40 GPUs
24 AMD Epyc Milan vCPU cores
192GB of RAM
250GB Storage

Training had taken approximately 10 days to finish and roughly around $3.1k had been spent on compute costs.

Prompting

As a result of the removal of underscores and including random tag order, it is now much more easier to prompt with the Waifu Diffusion 1.3 model compared to the predecessor 1.2 version.

So now, how do you generate something? Let's say you have a regular prompt like this:

a girl wearing a hoodie in the rain

To generate an image of a girl wearing a hoodie in the rain using Waifu Diffusion 1.3, you would prompt it out with booru tags that would look like:

original, 1girl, solo, portrait, hoodie, wearing hoodie

Looks boring, right? Thankfully, with Booru tags, all you have to do to make the output look more detailed is to simply add more tags. Let's say you want to add rain and backlighting, then you would have a prompt that looks like:

original, 1girl, solo, portrait, hoodie, wearing hoodie, red hoodie, long sleeves, simple background, backlighting, rain, night, depth of field

There, that's more like it.

Overall, to get better outputs you have to:

Add more tags to your prompt, especially compositional tags. A full list can be found here.
Include a copyright tag that is associated with high quality art like genshin impact or arknights.
Be specific. Use specific tags. The model cannot assume what you want, so you have to be specific.
Do not use underscores. They're deprecated since Waifu Diffusion 1.2.

License

The Waifu Diffusion 1.3 Weights have been released under the CreativeML Open RAIL-M License.

Sample Generations

_{1girl, black eyes, black hair, black sweater, blue background, bob cut, closed mouth, glasses, medium hair, red-framed eyewear, simple background, solo, sweater, upper body, wide-eyed}

_{1girl, aqua eyes, baseball cap, blonde hair, closed mouth, earrings, green background, hat, hoop earrings, jewelry, looking at viewer, shirt, short hair, simple background, solo, upper body, yellow shirt}

_{1girl, black bra, black hair, black panties, blush, borrowed character, bra, breasts, cleavage, closed mouth, gradient hair, hair bun, heart, large breasts, lips, looking at viewer, multicolored hair, navel, panties, pointy ears, red hair, short hair, sweat, underwear}

_{yakumo ran, arknights, 1girl, :d, animal ears, blonde hair, breasts, cowboy shot, extra ears, fox ears, fox shadow puppet, fox tail, head tilt, large breasts, looking at viewer, multiple tails, no headwear, short hair, simple background, smile, solo, tabard, tail, white background, yellow eyes}

_{chen, arknights, 1girl, animal ears, brown hair, cat ears, cat tail, closed mouth, earrings, face, hat, jewelry, lips, multiple tails, nekomata, painterly, red eyes, short hair, simple background, solo, tail, white background}

Team Members and Acknowledgements

This project would not have been possible without the incredible work by the CompVis Researchers. I would also like to personally thank everyone for their generous support in our Discord server! Thank you guys.

In order to reach us, you can join our Discord server.

8.0 KiB Raw Blame History