big update, adding auto-captioning

2022-10-30 21:59:26 -04:00 · 2022-10-30 21:59:26 -04:00 · 814440c288
parent 62ddb83042
commit 814440c288
14 changed files with 390 additions and 47 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,6 +2,12 @@
 /everydream-venv/**
 /laion/*.parquet
 /output/**
+/.cache/**
+/.venv/**
+/input/*.jpg
+/input/*.webp
+/input/*.png
+/scripts/BLIP

 # Byte-compiled / optimized / DLL files
 __pycache__/
--- a/README.MD
+++ b/README.MD
@ -4,60 +4,51 @@ This repo will contain tools for data engineering efforts for people interested

 For instance, by using ground truth Laion data mixed in with training data to replace "regularization" images, together with clip-interrogated captioning or original TEXT caption from laion, the final few concepts left of the original DreamBooth paper will have been removed.  This is a significant step towards towards full fine tuning capabilities. 

-Captioned training together with regularization has enabled multi-subject and multi-style training at the same time without
+Captioned training together with regularization has enabled multi-subject and multi-style training at the same time, and can scale to larger training efforts.

-You can download a large scale model for Final Fantasy 7 Remake here: https://huggingface.co/panopstor/ff7r-stable-diffusion and be sure to also follow up on the gist link at the bottom for more information along with links to example output of a multi-model fine tuning. 
+For example, you can download a large scale model for Final Fantasy 7 Remake here: https://huggingface.co/panopstor/ff7r-stable-diffusion and be sure to also follow up on the gist link at the bottom for more information along with links to example output of a multi-model fine tuning. 

 Since DreamBooth is now fading away in favor of improved techniques, I will call the tecnique of using fully captioned training together with ground truth data "EveryDream" to avoid confusion.

-If you are interested in caption training with stable diffusion and have a 24GB Nvidia GPU I suggest trying this repo out:
-https://github.com/victorchall/EveryDream-trainer (currently alpha but working)
+If you are interested in caption training with stable diffusion and general purpose fine tuning, and have a 24GB Nvidia GPU, you can try my trainer fork:
+https://github.com/victorchall/EveryDream-trainer (currently a bit beta but working)

 Join the EveryDream discord here: https://discord.gg/uheqxU6sXN

+## Tools
+
+[Download scrapes using Laion](./doc/LAION_SCRAPE.md) - Web scrapes images off the web using Laion data files.
+
+[Auto Captioning](./doc/AUTO_CAPTION.md) - Uses BLIP interrogation to caption images for training.
+
 ## Install

-Automatic venv setup scripts for linux and windows are a work in progress.  You can create one yourself or create a conda environment with the environment.yaml, and I suggest you do so to avoid dependecy conflicts.  This repo mainly uses aiohttp, aiofile, and pandas for the time being but expect other packages to be added in the future.
+You can use conda or venv.  This was developed on Python 3.10.5 but may work on older newer versions.
+
+One step venv setup:
+
+    create_venv.bat
+
+Don't forget to activate every time you open the command prompt later.
+
+    activate_venv.bat
+
+To use conda:

    conda env create -f environment.yaml

-Or you can configure your own venv, container, or just on your local Python use:
+    pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
+
+    git clone https://github.com/salesforce/BLIP scripts/BLIP
+
+    conda activate everydream
+
+Or you if you wish to configure your own venv, container/WSL, or Linux:

    pip install -r requirements.txt

-## download_laion.py
+    pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

-![](demo/demo03.png)
+    git clone https://github.com/salesforce/BLIP scripts/BLIP

-This script enables you to webscrape using the Laion parquet files which are available on Huggingface.co. 
-
-It has been tested with 2B-en-aesthetic, but may need minor tweaks for some other datasets that contain different columns.
-
-https://huggingface.co/datasets/laion/laion2B-en-aesthetic
-
-
- It will rename downloaded files to the best of its ability to the TEXT (caption) of the image with the original file extension, which can be plugged into the new class of caption-capable DreamBooth apps that will use the filename as the prompt for training.  
-
-One suggested use is to take this data and replace regularization images with ground truth data from the Laion dataset.
-
-It should execute quite quickly as it uses async task gathers for the the HTTP and fileio work. 
-
-Default folders are /laion for the parquest files and /output for downloaded images relative to the root folder, but consider disk space and point to another location if needed.
-
-ex. Query all the parquet files in ./laion for any image with a caption (TEXT) containing "a man" and attempt top stop after downloading (approximately) 50 files:
-
-    python scripts/download_laion.py --search_text "a man" --limit 50
-
-Query for person with a leading and trailing space, as they are not stripped:
-
-    python scripts/download_laion.py --search_text " person " --limit 200
-
-Query for both "man" and "photo" anywhere in the caption, and write them to z:/myDumpFolder instead of the default folder.  Useful if you need to put them on another drive, NAS, etc.  The default limit of 100 images will apply since --limit is omitted:
-
-    python scripts/download_laion.py --search_text "man,photo" --out_dir "z:/myDumpFolder" --laion_dir "x:/datahoard/laion5b"
-
-![](demo/demo02.png)
-
-## Other resources
-
-Nvidia has compiled a close up photo set here: https://github.com/NVlabs/ffhq-dataset
+Thanks to the SalesForce team for the BLIP tool. It uses CLIP to produce sane sentences like you would expect to see in alt-text.
--- a/activate_venv.bat
+++ b/activate_venv.bat
@ -1 +1 @@
-call everydream-venv/scripts/activate.bat
+call .venv/scripts/activate.bat
--- a/create_venv.bat
+++ b/create_venv.bat
@ -1 +1,15 @@
-python -m venv ./everydream-venv
+python -m venv .venv
+call .venv/scripts/activate.bat
+if %errorlevel% neq 0 goto :error
+pip install -r requirements.txt
+pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
+git clone https://github.com/salesforce/BLIP scripts/BLIP
+if %errorlevel% neq 0 goto :error
+
+goto :done
+
+:error
+echo Error occurred trying to install or activate venv.
+exit /b %errorlevel%
+
+:done
--- a/deactivate_venv.bat
+++ b/deactivate_venv.bat
@ -0,0 +1 @@
+call .venv/scripts/deactivate.bat
--- a/demo/beam_min_vs_q.webp
+++ b/demo/beam_min_vs_q.webp
--- a/demo/beam_vs_nucleus.webp
+++ b/demo/beam_vs_nucleus.webp
--- a/demo/beam_vs_nucleus_2.webp
+++ b/demo/beam_vs_nucleus_2.webp
--- a/doc/AUTO_CAPTION.md
+++ b/doc/AUTO_CAPTION.md
@ -0,0 +1,95 @@
+# Automatic captioning
+
+Automatic captioning uses Salesforce's BLIP to automatically create a clean sentence structure for captioning input images before training.
+
+This requires an Nvidia GPU with about 860MB of available VRAM. It should run fine on something like a 1050 2GB.
+
+Images should be **square** (1:1 H:W ratio), but they can be any size.  I suggest using [Birme](https://www.birme.net/?target_width=512&target_height=512&auto_focal=false&image_format=webp&quality_jpeg=95&quality_webp=99) to crop and resize first, but there are various tools out there for this.  I strongly suggest making sure to crop well for training! 
+
+Auto-caption is fast and not very resource intensive, but it still uses GPU.  You only need an Nvidia GPU with 2GB VRAM to run.
+
+Make sure cuda version of torch and torchvision are installed by activating your environment and running this command:
+
+    pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
+
+## Execute
+
+Place input files into the /input folder
+
+    python scripts/auto_caption.py
+
+Files will be **copied** and renamed to the caption as the file name and placed into /output. 
+
+## Additional command line args:
+
+### --img_dir
+
+Changes the default input directory to read for files.  Default is /input
+
+    python scripts/auto_caption.py --img_dir x:/data/my_cropped_images
+
+### --out_dir
+
+Changes the default output directory.  Default is /output
+
+    python scripts/auto_caption.py --out_dir x:/data/ready_to_train
+
+### --format
+
+"filename" or "mrwho"
+
+"filename" will simply name the file the caption .EXT and, if needed, add _n at the end to avoid collisions, for use with EveryDream trainer or Kane Wallmann's dream booth fork.  This is the default behavior if --format is not set.
+
+"mrwho" will add \[number\]@ as a prefix for use with MrWho's captioning system (ex. JoePenna dream both fork) which uses that naming standard to avoid file name collisions.
+
+    python scripts/auto_caption.py --format "mrwho"
+
+## Tweaks
+
+You may find the following setting useful to deal with issues with bad auto-captioning.  Start with defaults, and if you have issues with captions that seem inaccurate or reptitious, try some of the following settings. 
+
+### --nucleus
+
+Uses an alternative "nucleus" algorithm instead of the default "beam 16" algorithm.  Nucleus produces relatively short captions but reliably absent of repeated words and phrases, comparable to using beam 16 which can be adjusted further but may need more tweaking. 0.3 to 3 seem to produce sensible prompts.
+
+    python scripts/auto_caption.py --nucleus
+
+![Beam vs Nucleus](../demo/beam_vs_nucleus.webp)
+
+Additional captions for above with nucleus:
+
+nucleus q_factor 9999: *"a number of kites painted in different colors in a ceiling"*
+
+nucleus q_factor 200: *"a group of people waiting under art hanging from a ceiling"*
+
+nucleus q_factor 0.8: *"several people standing around with large colorful umbrellas"*
+
+nucleus q_factor 0.01: *"people are standing in an open building with colorful paper decorations"*
+
+nucleus q_factor 0.00001: (same as above)
+
+### --q_factor
+
+An adjustment for the algorithm used. 
+
+For the default beam 16 algorithm it limits the ability of words and phrases to be repeated.  Higher value reduces repeated words and phrases.  0.6-1.3 are sensible values for beam 16.  Default is 0.8 and works well with the defaulted value min_length == 24.  Consider using higher values if you use a min_length higher than 24 with beam 16.
+
+For nucleus (--nucleus), it simply changes the opinion on the prompt and does not impact repeats.  Values ranging from 0.01 to 200 seem sensible and default of 0.8 usually works well.
+
+![Beam vs Nucleus](../demo/beam_vs_nucleus_2.webp)
+
+### --min_length
+
+Adjusts the minimum length of prompt, measured in tokens.  **Only applies to beam 16.**  Useful to adjust along with --q_factor to keep it from repeating.
+
+Default is 24.  Sensible values are 15 to 30, max is 48.  Larger values are much more prone to repeating phrases and should be accompanied by increasing --q_factor to avoid repeats.
+
+    python scripts/auto_caption.py --min_length 20
+
+![Q vs Min for beam](../demo/beam_min_vs_q.webp)
+
+If you continue to both increase min_length and q_factor you start to get oddly specific prompts. For example using the above image:
+
+--q_factor 1.9  --min_length 48: 
+
+*"a painting of a group of people sitting at a table in a room with red drapes on the walls and gold trimmings on the ceiling, while one person is holding a wine glass in front of the other hand"*
--- a/doc/LAION_SCRAPE.md
+++ b/doc/LAION_SCRAPE.md
@ -0,0 +1,52 @@
+# download_laion.py
+
+![](../demo/demo03.png)
+
+This script enables you to webscrape using the Laion parquet files which are available on Huggingface.co. 
+
+It has been tested with 2B-en-aesthetic, but may need minor tweaks for some other datasets that contain different columns.  Keep in mind some other files are purely sidecar metadata.
+
+https://huggingface.co/datasets/laion/laion2B-en-aesthetic
+
+
+The script will rename downloaded files to the best of its ability to the TEXT (caption) of the image with the original file extension, which can be plugged into the new class of caption-capable DreamBooth apps or the EveryDream trainer that will use the filename as the prompt for training.  
+
+One suggested use is to take this data and replace regularization images with ground truth data from the Laion dataset.
+
+It should execute quite quickly as it uses async task gathers for the the HTTP and fileio work. 
+
+Default folders are /laion for the parquest files and /output for downloaded images relative to the root folder, but consider disk space and point to another location if needed.
+
+## Examples
+
+Query all the parquet files in ./laion for any image with a caption (TEXT) containing "a man" and attempt top stop after downloading (approximately) 50 files:
+
+    python scripts/download_laion.py --search_text "a man" --limit 50
+
+Query for person with a leading and trailing space:
+
+    python scripts/download_laion.py --search_text " person " --limit 200
+
+Query for both "man" and "photo" anywhere in the caption, and write them to z:/myDumpFolder instead of the default folder.  Useful if you need to put them on another drive, NAS, etc.  The default limit of 100 images will apply since --limit is omitted:
+
+    python scripts/download_laion.py --search_text "man,photo" --out_dir "z:/myDumpFolder" --laion_dir "x:/datahoard/laion5b"
+
+## Performance
+
+Script should be reasonably fast depending on your internet speed.  I'm able to pull 10,000 images in about 3 1/2 minutes on 1 Gbit fiber.  
+
+## Other resources
+
+Easy resize/crop tool: [Birme](https://www.birme.net/?target_width=512&target_height=512&auto_focal=false&image_format=webp&quality_jpeg=95&quality_webp=99)
+
+Nvidia has compiled a close up photo set: [ffhq-dataset](https://github.com/NVlabs/ffhq-dataset)
+
+## Batch run
+
+You can throw commands in a shell/cmd script to run several searches, but I will leave this exercise to the user
+
+    python scripts/download_laion.py --search_text "jan van eyck" --limit 200
+    python scripts/download_laion.py --search_text " hokusai" --limit 200
+    python scripts/download_laion.py --search_text " bernini" --limit 200
+    python scripts/download_laion.py --search_text "Gustav Klimt" --limit 200
+    python scripts/download_laion.py --search_text "engon Schiele" --limit 200
--- a/environment.yaml
+++ b/environment.yaml
@ -1,7 +1,10 @@
 name: everydream
 dependencies:
  - pandas>=1.4.3
-  - pyarrow>=9.0.0
  - aiofiles>=22.1.0
  - colorama>=0.4.5
-  - aiohttp>=3.8.3
+  - aiohttp>=3.8.3
+  #- open_clip_torch>=1.26.12
+  - timm
+  - fairscale==0.4.4
+  - transformers==4.19.2
--- a/requirements.txt
+++ b/requirements.txt
@ -2,4 +2,8 @@ pandas>=1.4.3
 pyarrow>=9.0.0
 aiofiles>=22.1.0
 colorama>=0.4.5
-aiohttp>=3.8.3
+aiohttp>=3.8.3
+#open_clip_torch>=1.26.12
+timm
+fairscale==0.4.4
+transformers==4.19.2
--- a/scripts/auto_caption.py
+++ b/scripts/auto_caption.py
@ -0,0 +1,179 @@
+import argparse
+import glob
+import os
+from PIL import Image
+import sys
+from torchvision import transforms
+from torchvision.transforms.functional import InterpolationMode
+import torch
+import aiohttp
+import asyncio
+
+SIZE = 384
+
+def get_parser(**parser_kwargs):
+    parser = argparse.ArgumentParser(**parser_kwargs)
+    parser.add_argument(
+        "--img_dir",
+        type=str,
+        nargs="?",
+        const=True,
+        default="input",
+        help="directory with images to be captioned",
+    ),
+    parser.add_argument(
+        "--out_dir",
+        type=str,
+        nargs="?",
+        const=True,
+        default="output",
+        help="directory to put captioned images",
+    ),
+    parser.add_argument(
+        "--format",
+        type=str,
+        nargs="?",
+        const=True,
+        default="filename",
+        help="'filename', 'json', or 'parquet'",
+    ),
+    parser.add_argument(
+        "--nucleus",
+        type=bool,
+        nargs="?",
+        const=True,
+        default=False,
+        help="use nucleus sampling instead of beam",
+    ),
+    parser.add_argument(
+        "--q_factor",
+        type=float,
+        nargs="?",
+        const=True,
+        default=0.8,
+        help="adjusts the likelihood of a word being repeated",
+    ),
+    parser.add_argument(
+        "--min_length",
+        type=int,
+        nargs="?",
+        const=True,
+        default=24,
+        help="adjusts the likelihood of a word being repeated",
+    )
+
+    return parser
+
+def load_image(raw_image, device):
+    transform = transforms.Compose([
+        #transforms.CenterCrop(SIZE),
+        transforms.Resize((SIZE, SIZE), interpolation=InterpolationMode.BICUBIC),
+        transforms.ToTensor(),
+        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
+    ])
+    image = transform(raw_image).unsqueeze(0).to(device)
+    return image
+
+async def main(opt):
+    print("starting")
+    import models.blip
+
+    sample = False
+    if opt.nucleus:
+        sample = True
+
+    input_dir = os.path.join(os.getcwd(), opt.img_dir)
+    print("input_dir: ", input_dir)
+
+    config_path = os.path.join(os.getcwd(), "scripts/BLIP/configs/med_config.json")
+
+    model_cache_path = ".cache/model_base_caption_capfilt_large.pth"
+    model_path = os.path.join(os.getcwd(), model_cache_path)
+
+    if not os.path.exists(model_path):
+        print(f"Downloading model to {model_path}... please wait")
+        blip_model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_caption_capfilt_large.pth'
+        async with aiohttp.ClientSession() as session:
+            async with session.get(blip_model_url) as res:
+                result = await res.read()
+                with open(model_path, 'wb') as f:
+                    f.write(result)
+        print(f"Model cached to: {model_path}")
+    else:
+        print(f"Model already cached to: {model_path}")
+
+    blip_decoder = models.blip.blip_decoder(pretrained=model_path, image_size=384, vit='base', med_config=config_path)
+    blip_decoder.eval()
+
+    print("loading model to cuda")
+
+    blip_decoder = blip_decoder.to(torch.device("cuda"))
+
+    ext = ('.jpg', '.jpeg', '.png', '.webp', '.tif', '.tga', '.tiff', '.bmp', '.gif')
+
+    i = 0
+
+    for idx, img_file_name in enumerate(glob.iglob(os.path.join(opt.img_dir, "*.*"))):
+        if img_file_name.endswith(ext):
+            caption = None
+            file_ext = os.path.splitext(img_file_name)[1]
+            if (file_ext in ext):
+                with open(img_file_name, "rb") as input_file:
+                    print("working image: ", img_file_name)
+
+                    image = Image.open(input_file)
+
+                    image = load_image(image, device=torch.device("cuda"))
+
+                    if opt.nucleus:
+                        captions = blip_decoder.generate(image, sample=True, top_p=opt.q_factor)
+                    else:
+                        captions = blip_decoder.generate(image, sample=sample, num_beams=16, min_length=opt.min_length, \
+                            max_length=48, repetition_penalty=opt.q_factor)
+
+                    caption = captions[0]
+
+                    input_file.seek(0)
+                    data = input_file.read()
+                    input_file.close()
+
+                    if opt.format in ["mrwho","joepenna"]:
+                        prefix = f"{i:05}@"
+                        i += 1
+                        caption = prefix+caption
+                    
+                    out_file = os.path.join(opt.out_dir, f"{caption}{file_ext}")
+                    print("   out_file:", out_file)
+                    print()
+                    
+                    if opt.format in ["filename","mrwho"]:
+                        #out_file = os.path.join(out_file)                    
+                        with open(out_file, "wb") as out_file:
+                            out_file.write(data)
+                    elif opt.format == "json":
+                        raise NotImplementedError
+                    elif opt.format == "parquet":
+                        raise NotImplementedError
+
+def isWindows():
+    return sys.platform.startswith("win")
+
+if __name__ == "__main__":
+    print("starting")
+    parser = get_parser()
+    opt = parser.parse_args()
+
+    if opt.format not in ["filename", "json", "mrwho", "joepenna", "parquet"]:
+        raise ValueError("format must be 'filename', 'json', or 'parquet'")
+    
+    if (isWindows()): 
+        print("Windows detected, using asyncio.WindowsSelectorEventLoopPolicy")
+        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
+    else:
+        print("Unix detected, using default asyncio event loop policy")
+
+    blip_path = os.path.join(os.getcwd(), "scripts/BLIP")
+    sys.path.append(blip_path)
+
+    asyncio.run(main(opt))
+  
--- a/scripts/download_laion.py
+++ b/scripts/download_laion.py
@ -1,7 +1,5 @@
 import sys
 import os
-from types import coroutine
-from unittest.util import _MAX_LENGTH
 import pandas as pd
 import pyarrow as pa
 import argparse