Merge branch 'main' of https://github.com/victorchall/EveryDream2trainer into main
This commit is contained in:
commit
d0a979ff9a
|
@ -5,13 +5,13 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Open-flamingo Captioning\n",
|
||||
"This notebook is an implementation of [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) for image captioning. \n",
|
||||
"# Cog Captioning\n",
|
||||
"This notebook is an implementation of [CogVLM](https://github.com/THUDM/CogVLM) for image captioning. \n",
|
||||
"\n",
|
||||
"This will require HIGH RAM shape on Google Colab, but T4 16gb is enough to run the 3B model. 9B model requires 24GB GPU or better.\n",
|
||||
"This may require HIGH RAM shape on Google Colab, but T4 16gb is enough (even if slow).\n",
|
||||
"\n",
|
||||
"1. Read [Docs](doc/CAPTION.md) for basic usage guide. \n",
|
||||
"2. Open in [Google Colab](https://colab.research.google.com/github/victorchall/EveryDream2trainer/blob/main/CaptionFL.ipynb) **OR** Runpod/Vast using the EveryDream2trainer docker container/template and open this notebook.\n",
|
||||
"1. Read [Docs](doc/CAPTION_COG.md) for basic usage guide. \n",
|
||||
"2. Open in [Google Colab](https://colab.research.google.com/github/victorchall/EveryDream2trainer/blob/main/CaptionCog.ipynb) **OR** Runpod/Vast using the EveryDream2trainer docker container/template and open this notebook.\n",
|
||||
"3. Run the cells below to install dependencies.\n",
|
||||
"4. Place your images in \"input\" folder or change the data_root to point to a Gdrive folder."
|
||||
]
|
||||
|
@ -23,9 +23,8 @@
|
|||
"outputs": [],
|
||||
"source": [
|
||||
"# install dependencies\n",
|
||||
"!pip install open-flamingo==2.0.0\n",
|
||||
"!pip install huggingface-hub==0.15.1\n",
|
||||
"!pip install transformers==4.30.2\n",
|
||||
"!pip install huggingface-hub\n",
|
||||
"!pip install transformers\n",
|
||||
"!pip install pynvml\n",
|
||||
"!pip install colorama"
|
||||
]
|
||||
|
@ -90,9 +89,10 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 24GB GPU, 9b model\n",
|
||||
"# 16GB GPU, must not use more than 1 beam\n",
|
||||
"# 24GB GPU, can use 3 beams\n",
|
||||
"%cd /content/EveryDream2trainer\n",
|
||||
"%run caption_fl.py --data_root \"input\" --min_new_tokens 20 --max_new_tokens 30 --num_beams 3 --model \"openflamingo/OpenFlamingo-9B-vitl-mpt7b\""
|
||||
"%run caption_cog.py --image_dir \"input\" --num_beams 1 --prompt \"Write a description.\""
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -101,28 +101,28 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 16GB GPU, 3b model\n",
|
||||
"# This is a fancier version of above with more options set\n",
|
||||
"%cd /content/EveryDream2trainer\n",
|
||||
"%run caption_fl.py --data_root \"input\" --min_new_tokens 20 --max_new_tokens 30 --num_beams 8 --model \"openflamingo/OpenFlamingo-3B-vitl-mpt1b\""
|
||||
"%run caption_cog.py --image_dir \"input\" --num_beams 1 --prompt \"Write a description.\" --starts_with \"An image of\" --remove_starts_with --temp 0.9 --top_p 0.9 --top_k 40 --bad_words \"depicts,showcases,appears,suggests\""
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"accelerator": "GPU",
|
||||
"colab": {
|
||||
"gpuType": "T4",
|
||||
"machine_shape": "hm",
|
||||
"provenance": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
},
|
||||
"orig_nbformat": 4,
|
||||
"colab": {
|
||||
"provenance": [],
|
||||
"machine_shape": "hm",
|
||||
"gpuType": "T4"
|
||||
},
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3"
|
||||
},
|
||||
"accelerator": "GPU"
|
||||
"orig_nbformat": 4
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
|
@ -12,6 +12,9 @@ It has the capability to output grounding bounding boxes.
|
|||
|
||||
Run `python caption_kosmos2.py --help` to get a list of options.
|
||||
|
||||
You can use `--prompt` to provide a prompt. The official suggested prompts are `An image of` or `Describe this image in detail:`. The later is the default if you do not set a prompt.
|
||||
If you want to use Kosmos-2 as a VQA (visual question answering), format your prompt like so `Question: Is there watermark on this image? Answer:`.
|
||||
|
||||
### _Kosmos-2 grounding_
|
||||
|
||||
Kosmos-2can generate bounding boxes for the "grounding" of the caption. This is useful for identifying specific objects in the image in 2D space, which can be useful in later piplines.
|
||||
|
@ -20,8 +23,8 @@ It's worth reading the documentation [here](https://huggingface.co/microsoft/kos
|
|||
|
||||
`--save_entities` outputs a '.ent' file with bounding box information. The entities identified will be based on what caption is produced.
|
||||
|
||||
`--phrase_mode` This modifies how the model is called, wrapping phrases in \<phrase> tags. This also interprets your prompt as a CSV, wrapping each item in a phrase tag. You might use it with `--prompt "dog,cat,tree"` for instance. *This is not a gaurantee your phrases will be found and output into the grounding output file.*
|
||||
`--phrase_mode` This modifies how the model is called, wrapping phrases in \<phrase> tags to identify specific classes. This also interprets your prompt as a CSV, wrapping each item in a phrase tag. You might use it with `--prompt "dog,cat,tree"` for instance. *This is not a gaurantee your phrases will be found and output into the grounding output file.* Things like `--phrase_mode --prompt "watermark"` might work as a poor man's watermark detector, but with mixed results so its best to test with your data.
|
||||
|
||||
`--save_entities_only` This will not attempt to write the caption into the .txt file at all. **This is recommended with `--phrase_mode`**. Using this option forces `--save_entities`.
|
||||
`--save_entities_only` This will not attempt to write the caption into the .txt file at all. **This is recommended with `--phrase_mode` for object detection**. Using this option forces `--save_entities`.
|
||||
|
||||
There is a trivial/dumb UI for viewing the grounding in the scripts folder. Launch it with `python scripts/grounding_ui.py` and it will open a window allowing you to select a directory, and it will display the images and bounding boxes.
|
||||
|
|
Loading…
Reference in New Issue