Compare commits

...

3 Commits

Author SHA1 Message Date
Victor Hall 45ecb11402 update cog doc with colab link 2024-03-24 10:02:38 -04:00
Victor Hall 87bfe652ce caption cog notebook2 2024-03-24 10:00:00 -04:00
Victor Hall 93659f3eb4 caption cog notebook 2024-03-24 09:30:23 -04:00
2 changed files with 48 additions and 17 deletions

View File

@ -6,14 +6,16 @@
"metadata": {},
"source": [
"# Cog Captioning\n",
"<a href=\"https://colab.research.google.com/github/nawnie/EveryDream2trainer/blob/main/CaptionCog.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"This notebook is an implementation of [CogVLM](https://github.com/THUDM/CogVLM) for image captioning. \n",
"\n",
"This may require HIGH RAM shape on Google Colab, but T4 16gb is enough (even if slow).\n",
"Read [Docs](doc/CAPTION_COG.md) for basic usage guide. \n",
"\n",
"1. Read [Docs](doc/CAPTION_COG.md) for basic usage guide. \n",
"2. Open in [Google Colab](https://colab.research.google.com/github/victorchall/EveryDream2trainer/blob/main/CaptionCog.ipynb) **OR** Runpod/Vast using the EveryDream2trainer docker container/template and open this notebook.\n",
"3. Run the cells below to install dependencies.\n",
"4. Place your images in \"input\" folder or change the data_root to point to a Gdrive folder."
"Open in [Google Colab](https://colab.research.google.com/github/victorchall/EveryDream2trainer/blob/main/CaptionCog.ipynb) **OR** use Runpod/Vast/Whatever using the EveryDream2trainer docker container/template and open this notebook.\n",
"\n",
"### Requirements\n",
"Ampere or newer GPU with 16GB+ VRAM (ex. A100, 3090, 4060 Ti 16GB, etc). \n",
"The 4bit quantization loading requires bfloat16 datatype support, which is not supported on older Turing GPUs like the T4 16GB, which rules out using free tier Google Colab.\n"
]
},
{
@ -23,10 +25,14 @@
"outputs": [],
"source": [
"# install dependencies\n",
"!pip install huggingface-hub\n",
"!pip install transformers\n",
"!pip install pynvml\n",
"!pip install colorama"
"!pip install huggingface-hub -q\n",
"!pip install transformers -q\n",
"!pip install pynvml -q\n",
"!pip install colorama -q\n",
"!pip install peft -q\n",
"!pip install bitsandbytes -q\n",
"!pip install einops -q\n",
"!pip install xformers -q"
]
},
{
@ -38,7 +44,8 @@
"# Colab only setup (do NOT run for docker/runpod/vast)\n",
"!git clone https://github.com/victorchall/EveryDream2trainer\n",
"%cd EveryDream2trainer\n",
"%mkdir -p /content/EveryDream2trainer/input"
"%mkdir -p /content/EveryDream2trainer/input\n",
"%cd /content/EveryDream2trainer"
]
},
{
@ -71,6 +78,17 @@
" tar_ref.extractall(input_folder)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Connect Gdrive (Optional, will popup a warning)\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
]
},
{
"attachments": {},
"cell_type": "markdown",
@ -78,9 +96,8 @@
"source": [
"## Run captions.\n",
"\n",
"Place your images in \"input\" folder, or you can change the data_root to point to a Gdrive folder.\n",
"\n",
"Run either the 24GB or 16GB model or adjust settings on your own."
"Place your images in \"input\" folder, or you can change the image_dir to point to a Gdrive folder.\n",
"Note: Colab may complain that you are running out of disk space, but it should still work.\n"
]
},
{
@ -92,7 +109,7 @@
"# 16GB GPU, must not use more than 1 beam\n",
"# 24GB GPU, can use 3 beams\n",
"%cd /content/EveryDream2trainer\n",
"%run caption_cog.py --image_dir \"input\" --num_beams 1 --prompt \"Write a description.\""
"%run caption_cog.py --image_dir \"input\" --num_beams 1 --prompt \"Write a description.\" --no_overwrite"
]
},
{
@ -103,8 +120,20 @@
"source": [
"# This is a fancier version of above with more options set\n",
"%cd /content/EveryDream2trainer\n",
"%run caption_cog.py --image_dir \"input\" --num_beams 1 --prompt \"Write a description.\" --starts_with \"An image of\" --remove_starts_with --temp 0.9 --top_p 0.9 --top_k 40 --bad_words \"depicts,showcases,appears,suggests\""
"%run caption_cog.py --image_dir \"input\" --num_beams 1 \\\n",
" --prompt \"Write a description.\" \\\n",
" --starts_with \"An image of\" --remove_starts_with \\\n",
" --temp 0.9 --top_p 0.9 --top_k 40 \\\n",
" --bad_words \"depicts,showcases,appears,suggests\" \\\n",
" --no_overwrite "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

View File

@ -1,11 +1,13 @@
# CogVLM captioning
CogVLM [code](https://github.com/THUDM/CogVLM) [model](https://huggingface.co/THUDM/cogvlm-chat-hf) is, so far (Q1 2024), the best model for automatically generating captions.
CogVLM ([code](https://github.com/THUDM/CogVLM)) ([model](https://huggingface.co/THUDM/cogvlm-chat-hf)) is, so far (Q1 2024), the best model for automatically generating captions.
The model uses about 13.5GB of VRAM due to 4bit inference with the default setting of 1 beam, and up to 4 or 5 beams is possible with a 24GB GPU meaning it is very capable on consumer hardware. It is slow, ~6-10 seconds on a RTX 3090, but the quality is worth it over other models.
The model uses about 13.5GB of VRAM due to 4bit inference with the default setting of 1 beam, and up to 4 or 5 beams is possible with a 24GB GPU meaning it is very capable on consumer hardware. It is slow, ~6-10+ seconds on a RTX 3090, but the quality is worth it over other models.
It is capable of naming and identifying things with proper nouns and has a large vocabulary. It can also readily read text even for hard to read fonts, from oblique angles, or from curved surfaces.
<a href="https://colab.research.google.com/github/nawnie/EveryDream2trainer/blob/main/CaptionCog.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
## Basics
Run `python caption_cog.py --help` to get a list of options.