big update, adding auto-captioning

This commit is contained in:
Victor Hall 2022-10-30 21:59:26 -04:00
parent 62ddb83042
commit 814440c288
14 changed files with 390 additions and 47 deletions

6
.gitignore vendored
View File

@ -2,6 +2,12 @@
/everydream-venv/** /everydream-venv/**
/laion/*.parquet /laion/*.parquet
/output/** /output/**
/.cache/**
/.venv/**
/input/*.jpg
/input/*.webp
/input/*.png
/scripts/BLIP
# Byte-compiled / optimized / DLL files # Byte-compiled / optimized / DLL files
__pycache__/ __pycache__/

View File

@ -4,60 +4,51 @@ This repo will contain tools for data engineering efforts for people interested
For instance, by using ground truth Laion data mixed in with training data to replace "regularization" images, together with clip-interrogated captioning or original TEXT caption from laion, the final few concepts left of the original DreamBooth paper will have been removed. This is a significant step towards towards full fine tuning capabilities. For instance, by using ground truth Laion data mixed in with training data to replace "regularization" images, together with clip-interrogated captioning or original TEXT caption from laion, the final few concepts left of the original DreamBooth paper will have been removed. This is a significant step towards towards full fine tuning capabilities.
Captioned training together with regularization has enabled multi-subject and multi-style training at the same time without Captioned training together with regularization has enabled multi-subject and multi-style training at the same time, and can scale to larger training efforts.
You can download a large scale model for Final Fantasy 7 Remake here: https://huggingface.co/panopstor/ff7r-stable-diffusion and be sure to also follow up on the gist link at the bottom for more information along with links to example output of a multi-model fine tuning. For example, you can download a large scale model for Final Fantasy 7 Remake here: https://huggingface.co/panopstor/ff7r-stable-diffusion and be sure to also follow up on the gist link at the bottom for more information along with links to example output of a multi-model fine tuning.
Since DreamBooth is now fading away in favor of improved techniques, I will call the tecnique of using fully captioned training together with ground truth data "EveryDream" to avoid confusion. Since DreamBooth is now fading away in favor of improved techniques, I will call the tecnique of using fully captioned training together with ground truth data "EveryDream" to avoid confusion.
If you are interested in caption training with stable diffusion and have a 24GB Nvidia GPU I suggest trying this repo out: If you are interested in caption training with stable diffusion and general purpose fine tuning, and have a 24GB Nvidia GPU, you can try my trainer fork:
https://github.com/victorchall/EveryDream-trainer (currently alpha but working) https://github.com/victorchall/EveryDream-trainer (currently a bit beta but working)
Join the EveryDream discord here: https://discord.gg/uheqxU6sXN Join the EveryDream discord here: https://discord.gg/uheqxU6sXN
## Tools
[Download scrapes using Laion](./doc/LAION_SCRAPE.md) - Web scrapes images off the web using Laion data files.
[Auto Captioning](./doc/AUTO_CAPTION.md) - Uses BLIP interrogation to caption images for training.
## Install ## Install
Automatic venv setup scripts for linux and windows are a work in progress. You can create one yourself or create a conda environment with the environment.yaml, and I suggest you do so to avoid dependecy conflicts. This repo mainly uses aiohttp, aiofile, and pandas for the time being but expect other packages to be added in the future. You can use conda or venv. This was developed on Python 3.10.5 but may work on older newer versions.
One step venv setup:
create_venv.bat
Don't forget to activate every time you open the command prompt later.
activate_venv.bat
To use conda:
conda env create -f environment.yaml conda env create -f environment.yaml
Or you can configure your own venv, container, or just on your local Python use: pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
git clone https://github.com/salesforce/BLIP scripts/BLIP
conda activate everydream
Or you if you wish to configure your own venv, container/WSL, or Linux:
pip install -r requirements.txt pip install -r requirements.txt
## download_laion.py pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
![](demo/demo03.png) git clone https://github.com/salesforce/BLIP scripts/BLIP
This script enables you to webscrape using the Laion parquet files which are available on Huggingface.co. Thanks to the SalesForce team for the BLIP tool. It uses CLIP to produce sane sentences like you would expect to see in alt-text.
It has been tested with 2B-en-aesthetic, but may need minor tweaks for some other datasets that contain different columns.
https://huggingface.co/datasets/laion/laion2B-en-aesthetic
It will rename downloaded files to the best of its ability to the TEXT (caption) of the image with the original file extension, which can be plugged into the new class of caption-capable DreamBooth apps that will use the filename as the prompt for training.
One suggested use is to take this data and replace regularization images with ground truth data from the Laion dataset.
It should execute quite quickly as it uses async task gathers for the the HTTP and fileio work.
Default folders are /laion for the parquest files and /output for downloaded images relative to the root folder, but consider disk space and point to another location if needed.
ex. Query all the parquet files in ./laion for any image with a caption (TEXT) containing "a man" and attempt top stop after downloading (approximately) 50 files:
python scripts/download_laion.py --search_text "a man" --limit 50
Query for person with a leading and trailing space, as they are not stripped:
python scripts/download_laion.py --search_text " person " --limit 200
Query for both "man" and "photo" anywhere in the caption, and write them to z:/myDumpFolder instead of the default folder. Useful if you need to put them on another drive, NAS, etc. The default limit of 100 images will apply since --limit is omitted:
python scripts/download_laion.py --search_text "man,photo" --out_dir "z:/myDumpFolder" --laion_dir "x:/datahoard/laion5b"
![](demo/demo02.png)
## Other resources
Nvidia has compiled a close up photo set here: https://github.com/NVlabs/ffhq-dataset

View File

@ -1 +1 @@
call everydream-venv/scripts/activate.bat call .venv/scripts/activate.bat

View File

@ -1 +1,15 @@
python -m venv ./everydream-venv python -m venv .venv
call .venv/scripts/activate.bat
if %errorlevel% neq 0 goto :error
pip install -r requirements.txt
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
git clone https://github.com/salesforce/BLIP scripts/BLIP
if %errorlevel% neq 0 goto :error
goto :done
:error
echo Error occurred trying to install or activate venv.
exit /b %errorlevel%
:done

1
deactivate_venv.bat Normal file
View File

@ -0,0 +1 @@
call .venv/scripts/deactivate.bat

BIN
demo/beam_min_vs_q.webp Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 119 KiB

BIN
demo/beam_vs_nucleus.webp Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

BIN
demo/beam_vs_nucleus_2.webp Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

95
doc/AUTO_CAPTION.md Normal file
View File

@ -0,0 +1,95 @@
# Automatic captioning
Automatic captioning uses Salesforce's BLIP to automatically create a clean sentence structure for captioning input images before training.
This requires an Nvidia GPU with about 860MB of available VRAM. It should run fine on something like a 1050 2GB.
Images should be **square** (1:1 H:W ratio), but they can be any size. I suggest using [Birme](https://www.birme.net/?target_width=512&target_height=512&auto_focal=false&image_format=webp&quality_jpeg=95&quality_webp=99) to crop and resize first, but there are various tools out there for this. I strongly suggest making sure to crop well for training!
Auto-caption is fast and not very resource intensive, but it still uses GPU. You only need an Nvidia GPU with 2GB VRAM to run.
Make sure cuda version of torch and torchvision are installed by activating your environment and running this command:
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
## Execute
Place input files into the /input folder
python scripts/auto_caption.py
Files will be **copied** and renamed to the caption as the file name and placed into /output.
## Additional command line args:
### --img_dir
Changes the default input directory to read for files. Default is /input
python scripts/auto_caption.py --img_dir x:/data/my_cropped_images
### --out_dir
Changes the default output directory. Default is /output
python scripts/auto_caption.py --out_dir x:/data/ready_to_train
### --format
"filename" or "mrwho"
"filename" will simply name the file the caption .EXT and, if needed, add _n at the end to avoid collisions, for use with EveryDream trainer or Kane Wallmann's dream booth fork. This is the default behavior if --format is not set.
"mrwho" will add \[number\]@ as a prefix for use with MrWho's captioning system (ex. JoePenna dream both fork) which uses that naming standard to avoid file name collisions.
python scripts/auto_caption.py --format "mrwho"
## Tweaks
You may find the following setting useful to deal with issues with bad auto-captioning. Start with defaults, and if you have issues with captions that seem inaccurate or reptitious, try some of the following settings.
### --nucleus
Uses an alternative "nucleus" algorithm instead of the default "beam 16" algorithm. Nucleus produces relatively short captions but reliably absent of repeated words and phrases, comparable to using beam 16 which can be adjusted further but may need more tweaking. 0.3 to 3 seem to produce sensible prompts.
python scripts/auto_caption.py --nucleus
![Beam vs Nucleus](../demo/beam_vs_nucleus.webp)
Additional captions for above with nucleus:
nucleus q_factor 9999: *"a number of kites painted in different colors in a ceiling"*
nucleus q_factor 200: *"a group of people waiting under art hanging from a ceiling"*
nucleus q_factor 0.8: *"several people standing around with large colorful umbrellas"*
nucleus q_factor 0.01: *"people are standing in an open building with colorful paper decorations"*
nucleus q_factor 0.00001: (same as above)
### --q_factor
An adjustment for the algorithm used.
For the default beam 16 algorithm it limits the ability of words and phrases to be repeated. Higher value reduces repeated words and phrases. 0.6-1.3 are sensible values for beam 16. Default is 0.8 and works well with the defaulted value min_length == 24. Consider using higher values if you use a min_length higher than 24 with beam 16.
For nucleus (--nucleus), it simply changes the opinion on the prompt and does not impact repeats. Values ranging from 0.01 to 200 seem sensible and default of 0.8 usually works well.
![Beam vs Nucleus](../demo/beam_vs_nucleus_2.webp)
### --min_length
Adjusts the minimum length of prompt, measured in tokens. **Only applies to beam 16.** Useful to adjust along with --q_factor to keep it from repeating.
Default is 24. Sensible values are 15 to 30, max is 48. Larger values are much more prone to repeating phrases and should be accompanied by increasing --q_factor to avoid repeats.
python scripts/auto_caption.py --min_length 20
![Q vs Min for beam](../demo/beam_min_vs_q.webp)
If you continue to both increase min_length and q_factor you start to get oddly specific prompts. For example using the above image:
--q_factor 1.9 --min_length 48:
*"a painting of a group of people sitting at a table in a room with red drapes on the walls and gold trimmings on the ceiling, while one person is holding a wine glass in front of the other hand"*

52
doc/LAION_SCRAPE.md Normal file
View File

@ -0,0 +1,52 @@
# download_laion.py
![](../demo/demo03.png)
This script enables you to webscrape using the Laion parquet files which are available on Huggingface.co.
It has been tested with 2B-en-aesthetic, but may need minor tweaks for some other datasets that contain different columns. Keep in mind some other files are purely sidecar metadata.
https://huggingface.co/datasets/laion/laion2B-en-aesthetic
The script will rename downloaded files to the best of its ability to the TEXT (caption) of the image with the original file extension, which can be plugged into the new class of caption-capable DreamBooth apps or the EveryDream trainer that will use the filename as the prompt for training.
One suggested use is to take this data and replace regularization images with ground truth data from the Laion dataset.
It should execute quite quickly as it uses async task gathers for the the HTTP and fileio work.
Default folders are /laion for the parquest files and /output for downloaded images relative to the root folder, but consider disk space and point to another location if needed.
## Examples
Query all the parquet files in ./laion for any image with a caption (TEXT) containing "a man" and attempt top stop after downloading (approximately) 50 files:
python scripts/download_laion.py --search_text "a man" --limit 50
Query for person with a leading and trailing space:
python scripts/download_laion.py --search_text " person " --limit 200
Query for both "man" and "photo" anywhere in the caption, and write them to z:/myDumpFolder instead of the default folder. Useful if you need to put them on another drive, NAS, etc. The default limit of 100 images will apply since --limit is omitted:
python scripts/download_laion.py --search_text "man,photo" --out_dir "z:/myDumpFolder" --laion_dir "x:/datahoard/laion5b"
## Performance
Script should be reasonably fast depending on your internet speed. I'm able to pull 10,000 images in about 3 1/2 minutes on 1 Gbit fiber.
## Other resources
Easy resize/crop tool: [Birme](https://www.birme.net/?target_width=512&target_height=512&auto_focal=false&image_format=webp&quality_jpeg=95&quality_webp=99)
Nvidia has compiled a close up photo set: [ffhq-dataset](https://github.com/NVlabs/ffhq-dataset)
## Batch run
You can throw commands in a shell/cmd script to run several searches, but I will leave this exercise to the user
python scripts/download_laion.py --search_text "jan van eyck" --limit 200
python scripts/download_laion.py --search_text " hokusai" --limit 200
python scripts/download_laion.py --search_text " bernini" --limit 200
python scripts/download_laion.py --search_text "Gustav Klimt" --limit 200
python scripts/download_laion.py --search_text "engon Schiele" --limit 200

View File

@ -1,7 +1,10 @@
name: everydream name: everydream
dependencies: dependencies:
- pandas>=1.4.3 - pandas>=1.4.3
- pyarrow>=9.0.0
- aiofiles>=22.1.0 - aiofiles>=22.1.0
- colorama>=0.4.5 - colorama>=0.4.5
- aiohttp>=3.8.3 - aiohttp>=3.8.3
#- open_clip_torch>=1.26.12
- timm
- fairscale==0.4.4
- transformers==4.19.2

View File

@ -3,3 +3,7 @@ pyarrow>=9.0.0
aiofiles>=22.1.0 aiofiles>=22.1.0
colorama>=0.4.5 colorama>=0.4.5
aiohttp>=3.8.3 aiohttp>=3.8.3
#open_clip_torch>=1.26.12
timm
fairscale==0.4.4
transformers==4.19.2

179
scripts/auto_caption.py Normal file
View File

@ -0,0 +1,179 @@
import argparse
import glob
import os
from PIL import Image
import sys
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
import torch
import aiohttp
import asyncio
SIZE = 384
def get_parser(**parser_kwargs):
parser = argparse.ArgumentParser(**parser_kwargs)
parser.add_argument(
"--img_dir",
type=str,
nargs="?",
const=True,
default="input",
help="directory with images to be captioned",
),
parser.add_argument(
"--out_dir",
type=str,
nargs="?",
const=True,
default="output",
help="directory to put captioned images",
),
parser.add_argument(
"--format",
type=str,
nargs="?",
const=True,
default="filename",
help="'filename', 'json', or 'parquet'",
),
parser.add_argument(
"--nucleus",
type=bool,
nargs="?",
const=True,
default=False,
help="use nucleus sampling instead of beam",
),
parser.add_argument(
"--q_factor",
type=float,
nargs="?",
const=True,
default=0.8,
help="adjusts the likelihood of a word being repeated",
),
parser.add_argument(
"--min_length",
type=int,
nargs="?",
const=True,
default=24,
help="adjusts the likelihood of a word being repeated",
)
return parser
def load_image(raw_image, device):
transform = transforms.Compose([
#transforms.CenterCrop(SIZE),
transforms.Resize((SIZE, SIZE), interpolation=InterpolationMode.BICUBIC),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
image = transform(raw_image).unsqueeze(0).to(device)
return image
async def main(opt):
print("starting")
import models.blip
sample = False
if opt.nucleus:
sample = True
input_dir = os.path.join(os.getcwd(), opt.img_dir)
print("input_dir: ", input_dir)
config_path = os.path.join(os.getcwd(), "scripts/BLIP/configs/med_config.json")
model_cache_path = ".cache/model_base_caption_capfilt_large.pth"
model_path = os.path.join(os.getcwd(), model_cache_path)
if not os.path.exists(model_path):
print(f"Downloading model to {model_path}... please wait")
blip_model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'
async with aiohttp.ClientSession() as session:
async with session.get(blip_model_url) as res:
result = await res.read()
with open(model_path, 'wb') as f:
f.write(result)
print(f"Model cached to: {model_path}")
else:
print(f"Model already cached to: {model_path}")
blip_decoder = models.blip.blip_decoder(pretrained=model_path, image_size=384, vit='base', med_config=config_path)
blip_decoder.eval()
print("loading model to cuda")
blip_decoder = blip_decoder.to(torch.device("cuda"))
ext = ('.jpg', '.jpeg', '.png', '.webp', '.tif', '.tga', '.tiff', '.bmp', '.gif')
i = 0
for idx, img_file_name in enumerate(glob.iglob(os.path.join(opt.img_dir, "*.*"))):
if img_file_name.endswith(ext):
caption = None
file_ext = os.path.splitext(img_file_name)[1]
if (file_ext in ext):
with open(img_file_name, "rb") as input_file:
print("working image: ", img_file_name)
image = Image.open(input_file)
image = load_image(image, device=torch.device("cuda"))
if opt.nucleus:
captions = blip_decoder.generate(image, sample=True, top_p=opt.q_factor)
else:
captions = blip_decoder.generate(image, sample=sample, num_beams=16, min_length=opt.min_length, \
max_length=48, repetition_penalty=opt.q_factor)
caption = captions[0]
input_file.seek(0)
data = input_file.read()
input_file.close()
if opt.format in ["mrwho","joepenna"]:
prefix = f"{i:05}@"
i += 1
caption = prefix+caption
out_file = os.path.join(opt.out_dir, f"{caption}{file_ext}")
print(" out_file:", out_file)
print()
if opt.format in ["filename","mrwho"]:
#out_file = os.path.join(out_file)
with open(out_file, "wb") as out_file:
out_file.write(data)
elif opt.format == "json":
raise NotImplementedError
elif opt.format == "parquet":
raise NotImplementedError
def isWindows():
return sys.platform.startswith("win")
if __name__ == "__main__":
print("starting")
parser = get_parser()
opt = parser.parse_args()
if opt.format not in ["filename", "json", "mrwho", "joepenna", "parquet"]:
raise ValueError("format must be 'filename', 'json', or 'parquet'")
if (isWindows()):
print("Windows detected, using asyncio.WindowsSelectorEventLoopPolicy")
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
else:
print("Unix detected, using default asyncio event loop policy")
blip_path = os.path.join(os.getcwd(), "scripts/BLIP")
sys.path.append(blip_path)
asyncio.run(main(opt))

View File

@ -1,7 +1,5 @@
import sys import sys
import os import os
from types import coroutine
from unittest.util import _MAX_LENGTH
import pandas as pd import pandas as pd
import pyarrow as pa import pyarrow as pa
import argparse import argparse