Merge branch 'harubaru:main' into patch-2

This commit is contained in:
Carlos Chavez 2022-09-10 19:40:10 -05:00 committed by GitHub
commit a532563a71
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 198 additions and 192 deletions

189
README.md
View File

@ -9,7 +9,12 @@ Waifu Diffusion is the name for this project of finetuning Stable Diffusion on D
<sub>Prompt: touhou 1girl komeiji_koishi portrait</sub>
## Documentation
[Training Guide](https://github.com/harubaru/waifu-diffusion/blob/main/docs/en/training/README.md)
[Index](./docs/en/README.md)
[Weights](./docs/en/weights/README.md)
[Training Guide](./docs/en/training/README.md)
All thanks goes to CompVis and Stability AI for releasing this codebase!
@ -22,188 +27,6 @@ Model Link: https://huggingface.co/hakurei/waifu-diffusion
# Stable Diffusion
*Stable Diffusion was made possible thanks to a collaboration with [Stability AI](https://stability.ai/) and [Runway](https://runwayml.com/) and builds upon our previous work:*
[**High-Resolution Image Synthesis with Latent Diffusion Models**](https://ommer-lab.com/research/latent-diffusion-models/)<br/>
[Robin Rombach](https://github.com/rromb)\*,
[Andreas Blattmann](https://github.com/ablattmann)\*,
[Dominik Lorenz](https://github.com/qp-qp)\,
[Patrick Esser](https://github.com/pesser),
[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>
**CVPR '22 Oral**
which is available on [GitHub](https://github.com/CompVis/latent-diffusion). PDF at [arXiv](https://arxiv.org/abs/2112.10752). Please also visit our [Project page](https://ommer-lab.com/research/latent-diffusion-models/).
![txt2img-stable2](assets/stable-samples/txt2img/merged-0006.png)
[Stable Diffusion](#stable-diffusion-v1) is a latent text-to-image diffusion
model.
Thanks to a generous compute donation from [Stability AI](https://stability.ai/) and support from [LAION](https://laion.ai/), we were able to train a Latent Diffusion Model on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) database.
Similar to Google's [Imagen](https://arxiv.org/abs/2205.11487),
this model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts.
With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM.
See [this section](#stable-diffusion-v1) below and the [model card](https://huggingface.co/CompVis/stable-diffusion).
## Requirements
A suitable [conda](https://conda.io/) environment named `ldm` can be created
and activated with:
```
conda env create -f environment.yaml
conda activate ldm
```
You can also update an existing [latent diffusion](https://github.com/CompVis/latent-diffusion) environment by running
```
conda install pytorch torchvision -c pytorch
pip install transformers==4.19.2
pip install -e .
```
## Stable Diffusion v1
Stable Diffusion v1 refers to a specific configuration of the model
architecture that uses a downsampling-factor 8 autoencoder with an 860M UNet
and CLIP ViT-L/14 text encoder for the diffusion model. The model was pretrained on 256x256 images and
then finetuned on 512x512 images.
*Note: Stable Diffusion v1 is a general text-to-image diffusion model and therefore mirrors biases and (mis-)conceptions that are present
in its training data.
Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding [model card](https://huggingface.co/CompVis/stable-diffusion).
Research into the safe deployment of general text-to-image models is an ongoing effort. To prevent misuse and harm, we currently provide access to the checkpoints only for [academic research purposes upon request](https://stability.ai/academia-access-form).
**This is an experiment in safe and community-driven publication of a capable and general text-to-image model. We are working on a public release with a more permissive license that also incorporates ethical considerations.***
[Request access to Stable Diffusion v1 checkpoints for academic research](https://stability.ai/academia-access-form)
### Weights
We currently provide three checkpoints, `sd-v1-1.ckpt`, `sd-v1-2.ckpt` and `sd-v1-3.ckpt`,
which were trained as follows,
- `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
- `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`.
515k steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
- `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-improved-aesthetics" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling
steps show the relative improvements of the checkpoints:
![sd evaluation results](assets/v1-variants-scores.jpg)
### Text-to-Image with Stable Diffusion
![txt2img-stable2](assets/stable-samples/txt2img/merged-0005.png)
![txt2img-stable2](assets/stable-samples/txt2img/merged-0007.png)
Stable Diffusion is a latent diffusion model conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder.
#### Sampling Script
After [obtaining the weights](#weights), link them
```
mkdir -p models/ldm/stable-diffusion-v1/
ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt
```
and sample with
```
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
```
By default, this uses a guidance scale of `--scale 7.5`, [Katherine Crowson's implementation](https://github.com/CompVis/latent-diffusion/pull/51) of the [PLMS](https://arxiv.org/abs/2202.09778) sampler,
and renders images of size 512x512 (which it was trained on) in 50 steps. All supported arguments are listed below (type `python scripts/txt2img.py --help`).
```commandline
usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA] [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS]
[--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT] [--seed SEED] [--precision {full,autocast}]
optional arguments:
-h, --help show this help message and exit
--prompt [PROMPT] the prompt to render
--outdir [OUTDIR] dir to write results to
--skip_grid do not save a grid, only individual samples. Helpful when evaluating lots of samples
--skip_save do not save individual samples. For speed measurements.
--ddim_steps DDIM_STEPS
number of ddim sampling steps
--plms use plms sampling
--laion400m uses the LAION400M model
--fixed_code if enabled, uses the same starting code across samples
--ddim_eta DDIM_ETA ddim eta (eta=0.0 corresponds to deterministic sampling
--n_iter N_ITER sample this often
--H H image height, in pixel space
--W W image width, in pixel space
--C C latent channels
--f F downsampling factor
--n_samples N_SAMPLES
how many samples to produce for each given prompt. A.k.a. batch size
--n_rows N_ROWS rows in the grid (default: n_samples)
--scale SCALE unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
--from-file FROM_FILE
if specified, load prompts from this file
--config CONFIG path to config which constructs model
--ckpt CKPT path to checkpoint of model
--seed SEED the seed (for reproducible sampling)
--precision {full,autocast}
evaluate at this precision
```
Note: The inference config for all v1 versions is designed to be used with EMA-only checkpoints.
For this reason `use_ema=False` is set in the configuration, otherwise the code will try to switch from
non-EMA to EMA weights. If you want to examine the effect of EMA vs no EMA, we provide "full" checkpoints
which contain both types of weights. For these, `use_ema=False` will load and use the non-EMA weights.
#### Diffusers Integration
Another way to download and sample Stable Diffusion is by using the [diffusers library](https://github.com/huggingface/diffusers/tree/main#new--stable-diffusion-is-now-fully-compatible-with-diffusers)
```py
# make sure you're logged in with `huggingface-cli login`
from torch import autocast
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-3-diffusers",
use_auth_token=True
)
prompt = "a photo of an astronaut riding a horse on mars"
with autocast("cuda"):
image = pipe(prompt)["sample"][0]
image.save("astronaut_rides_horse.png")
```
### Image Modification with Stable Diffusion
By using a diffusion-denoising mechanism as first proposed by [SDEdit](https://arxiv.org/abs/2108.01073), the model can be used for different
tasks such as text-guided image-to-image translation and upscaling. Similar to the txt2img sampling script,
we provide a script to perform image modification with Stable Diffusion.
The following describes an example where a rough sketch made in [Pinta](https://www.pinta-project.com/) is converted into a detailed artwork.
```
python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8
```
Here, strength is a value between 0.0 and 1.0, that controls the amount of noise that is added to the input image.
Values that approach 1.0 allow for lots of variations but will also produce images that are not semantically consistent with the input. See the following example.
**Input**
![sketch-in](assets/stable-samples/img2img/sketch-mountains-input.jpg)
**Outputs**
![out3](assets/stable-samples/img2img/mountains-3.png)
![out2](assets/stable-samples/img2img/mountains-2.png)
This procedure can, for example, also be used to upscale samples from the base model.
## Comments
- Our codebase for the diffusion models builds heavily on [OpenAI's ADM codebase](https://github.com/openai/guided-diffusion)

80
danbooru_data/download.py Normal file
View File

@ -0,0 +1,80 @@
import os
import json
import requests
import multiprocessing
import tqdm
# downloads URLs from JSON
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--file', '-f', type=str, required=False)
parser.add_argument('--out_dir', '-o', type=str, required=False)
parser.add_argument('--threads', '-p', required=False, default=32)
args = parser.parse_args()
class DownloadManager():
def __init__(self, max_threads=32):
self.failed_downloads = []
self.max_threads = max_threads
# args = (link, metadata, out_img_dir, out_text_dir)
def download(self, args):
try:
r = requests.get(args[0], stream=True)
with open(args[2] + args[0].split('/')[-1], 'wb') as f:
for chunk in r.iter_content(1024):
f.write(chunk)
with open(args[3] + args[0].split('/')[-1].split('.')[0] + '.txt', 'w') as f:
f.write(args[1])
except:
self.failed_downloads.append((args[0], args[1]))
def download_urls(self, file_path, out_dir):
with open(file_path) as f:
data = json.load(f)
if not os.path.exists(out_dir):
os.makedirs(out_dir)
os.makedirs(out_dir + '/img')
os.makedirs(out_dir + '/text')
thread_args = []
print(f'Loading {file_path} for download on {self.max_threads} threads...')
# create initial thread_args
for k, v in tqdm.tqdm(data.items()):
thread_args.append((k, v, out_dir + 'img/', out_dir + 'text/'))
# divide thread_args into chunks divisible by max_threads
chunks = []
for i in range(0, len(thread_args), self.max_threads):
chunks.append(thread_args[i:i+self.max_threads])
print(f'Downloading {len(thread_args)} images...')
# download chunks synchronously
for chunk in tqdm.tqdm(chunks):
with multiprocessing.Pool(self.max_threads) as p:
p.map(self.download, chunk)
if len(self.failed_downloads) > 0:
print("Failed downloads:")
for i in self.failed_downloads:
print(i[0])
print("\n")
"""
# attempt to download any remaining failed downloads
print('\nAttempting to download any failed downloads...')
print('Failed downloads:', len(self.failed_downloads))
if len(self.failed_downloads) > 0:
for url in tqdm.tqdm(self.failed_downloads):
self.download((url[0], url[1], out_dir + 'img/', out_dir + 'text/'))
"""
if __name__ == '__main__':
dm = DownloadManager(max_threads=args.threads)
dm.download_urls(args.file, args.out_dir)

50
danbooru_data/scrape.py Normal file
View File

@ -0,0 +1,50 @@
import threading
import requests
import json
import random
from pybooru import Danbooru
from tqdm import tqdm
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--danbooru_username', '-user', type=str, required=False)
parser.add_argument('--danbooru_key', '-key', type=str, required=False)
parser.add_argument('--tags', '-t', required=False, default="solo -comic -animated -touhou -rating:general order:score age:<1month")
parser.add_argument('--posts', '-p', required=False, default=10000)
parser.add_argument('--output', '-o', required=False, default='links.json')
args = parser.parse_args()
class DanbooruScraper():
def __init__(self, username, key):
self.username = username
self.key = key
self.dbclient = Danbooru('danbooru', username=self.username, api_key=self.key)
# This will get danbooru urls and tags, put them in a dict, then write as a json file
def get_urls(self, tags, num_posts, batch_size, file="data_urls.json"):
dict = {}
if num_posts % batch_size != 0:
print("Error: num_posts must be divisible by batch_size")
return
for i in tqdm(range(num_posts//batch_size)):
urls = self.dbclient.post_list(tags=tags, limit=batch_size, random=False, page=i)
if not urls:
print(f'Empty results at {i}')
break
for j in urls:
if 'file_url' in j:
if j['file_url'] not in dict:
d_url = j['file_url']
d_tags = j['tag_string_copyright'] + " " + j['tag_string_character'] + " " + j['tag_string_general'] + " " + j['tag_string_artist']
dict[d_url] = d_tags
else:
print("Error: file_url not found")
with open(file, 'w') as f:
json.dump(dict, f)
# now test
if __name__ == "__main__":
ds = DanbooruScraper(args.danbooru_username, args.danbooru_key)
ds.get_urls(args.tags, args.posts, 100, file=args.output)

View File

@ -2,4 +2,6 @@
Waifu Diffusion is a project based off CompVis/Stable-Diffusion.
For guidance on how to start training, see [training](https://github.com/harubaru/waifu-diffusion/tree/main/docs/en/training).
For guidance on how to start training, see [training](./training/README.md).
For a list of trained weights, see [weights](./weights/README.md).

View File

@ -1,8 +1,8 @@
# Training documentation
Training is available with waifu-diffusion. Before starting, we remind you that, at this moment at least 30GB of VRAM is needed, along with at least 30gb of storage if you don't mind cleaning up every so often.
## Contents
1. [Dataset](https://github.com/harubaru/waifu-diffusion/blob/main/docs/en/training/dataset.md)
2. [Configuration](https://github.com/harubaru/waifu-diffusion/blob/main/docs/en/training/configuration.md)
1. [Dataset](./dataset.md)
2. [Configuration](./configuration.md)
3. Executing
4. Recommendations
5. FAQ
5. FAQ

View File

@ -9,9 +9,13 @@ In this guide we are going to use the Danbooru2021 dataset by Gwern.net. You are
4. Packaging the dataset
## Dataset requirements
The dataset needs to be in the following format
/dataset/ : Root dataset folder, can be any name
/dataset/img/ : Folder for images
/dataset/txt/ : Folder for text files
It is recommended to have the images in 512x512 resolution and in JPG format. While the text files need to have the same name as the images it refers to.
@ -38,23 +42,35 @@ apt install rsync
````
#### Windows
On Windows, you are going to need to install Cygwin, a posix runtime for Windows which allows the usage of many linux-only programs inside windows.
[Cygwin Installer for x86](https://www.cygwin.com/setup-x86_64.exe)
On the installer, select mirrors.kernel.org for Download Site:
![[cygwin-mirrors.png]]
![cygwin-mirrors.png](./res/cygwin-mirrors.png)
Next, search for "rsync" on the search bar, change "View: Pending" to "View: Full", and select on the "New" tab the latest version. Do the same for "zip".
![[cygwin-packages.png]]
![cygwin-packages.png](./res/cygwin-packages.png)
GIF explaining the entire process:
![[cygwin-gif.gif]]
![cygwin-gif.gif](./res/cygwin-gif.gif)
Once the installation is finished, you should see "Cygwin64 Terminal" on your Start Menu. Launch it and you should be greated by the following window:
![[cygwin-idle.png]]
![cygwin-idle.png](./res/cygwin-idle.png)
You may now follow the intructions
### Downloading the dataset
Remember that instructions here apply universally, both on Linux and Windows (If you are using Cygwin that is).
The entire dataset weights about 5TB. You are not going to download everything, instead, you are only going to download two kinds of files:
1. The images
2. The JSON files (metadata)
If you want to see the entire file list, you can refer to the [Danbooru2021 information site](https://www.gwern.net/Danbooru2021).
We are going to extract the images from the 512px folder for convinience, since this folder already has the images resized to 512x512 resolution in JPG format. It only has safe rated images, for NSFW refer to [gwern.net](https://www.gwern.net/Danbooru2021#samples).
@ -85,7 +101,8 @@ Change "/waifu-diffusion" to the path of the cloned waifu-diffusion repository.
This script will also change some tags such as "1girl" to "one girl", "2boys" to "two boys", and so on. It will also add "upoaded on Danbooru".
Once the script has finished, you should have a "labeled_data" folder, whose insides look like this:
![[labeled_data-insides.png]]
![labeled_data-insides.png](./res/labeled_data-insides.png)
## Packaging the dataset
In order to reduce size, zip the contents of labeled_data:

15
docs/en/weights/README.md Normal file
View File

@ -0,0 +1,15 @@
# Weights
The following is a small list of available weights released by the Waifu Diffusion project:
- Waifu Diffusion v1.2
Release Date: 07/09/2022
Steps/Epochs/Images: 5 Epochs, 56,000 Images
Download: [Mirrors](./danbooru-7-09-2022/README.md)
License: None
Authors: Haru (haru#1367@discord)

View File

@ -0,0 +1,19 @@
Waifu Diffusion v1.2
Release Date: 07/09/2022
Steps/Epochs/Images: 5 Epochs, 56,000 Images
License: None
Authors: Haru (haru#1367@discord)
Mirrors:
Google Drive (rate limit): https://drive.google.com/file/d/1XeoFCILTcc9kn_5uS-G0uqWS5XVANpha
Magnet Link: magnet:?xt=urn:btih:INEYUMLLBBMZF22IIP4AEXLUK6XQKCSD&dn=wd-v1-2-full-ema.ckpt&xl=7703810927&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
HTTPS mirror: https://thisanimedoesnotexist.ai/downloads/wd-v1-2-full-ema.ckpt (Fastest)
HTTP mirror: http://wd.links.sd:8880/wd-v1-2-full-ema.ckpt