CogVLM ([code](https://github.com/THUDM/CogVLM)) ([model](https://huggingface.co/THUDM/cogvlm-chat-hf)) is, so far (Q1 2024), the best model for automatically generating captions.
The model uses about 13.5GB of VRAM due to 4bit inference with the default setting of 1 beam, and up to 4 or 5 beams is possible with a 24GB GPU meaning it is very capable on consumer hardware. It is slow, ~6-10+ seconds on a RTX 3090, but the quality is worth it over other models.
It is capable of naming and identifying things with proper nouns and has a large vocabulary. It can also readily read text even for hard to read fonts, from oblique angles, or from curved surfaces.
<ahref="https://colab.research.google.com/github/nawnie/EveryDream2trainer/blob/main/CaptionCog.ipynb"target="_parent"><imgsrc="https://colab.research.google.com/assets/colab-badge.svg"alt="Open In Colab"/></a>
Both the ([Vicuna-based](https://huggingface.co/THUDM/cogvlm-chat-hf)) and ([Llama3-based](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B)) models are supported.
Choose these by using one of these two CLI args:
--model THUDM/cogvlm-chat-hf
--model THUDM/cogvlm2-llama3-chat-19B
The script uses the Vicuna model (first) by default if no `--model` arg is specified.
When using Llava, the script will perform some clean-up operations to remove some less-than-useful language from the caption because the bad_words part of the Hugginface Transformers API is not supported by Llava.
Run `python caption_cog.py --help` to get a list of options.
You can get started just by providing the root path to where all your images are located. The script will create .txt sidecar files for each image in the same directory, an run recursively through subdirectories. The default prompt `Write a description.` is used when no prompt is provided.
The default prompt is `Write a description.` if none is provided.
Basic usage for prompt:
`--prompt "Write a description that includes [...] "`
I've found the longer the prompt the less effective it can be, but it's worth experimenting with this or tailoring it to your data if feasible, to tease out specific details you want in your captoins. See [Prompt modification plugins](#prompt-modifcation-plugins) for more capability.
Some prompt ideas:
`Write a concise, accurate, blunt, and detailed description. Avoid euphemisms, vague wording, or ambiguous expressions. Do not exceed 21 words.`
If you know the images are all of a single subject/character, you can ask it to be more specific about the subject:
`Write a desciption. Include pose, outfit, and surroundings. Be concise, accurate, blunt, and detailed description. Avoid euphemisms, vague wording, or ambiguous expressions. Do not exceed 26 words.`
You can add this somewhere in the prompt to get it to attempt ot describe the "style" of the image:
`Include the style or medium of the artwork.`
You can include hints to help the model understand the context, such as if you have a folder full of photos from Iceland, add this as part of your prompt:
`As a hint, this is from Peru. Write a description...`
or
`Write a description of this photo taken in Peru.`
Christoph Shuhmann and Peter Bevan's [laion-pop](https://huggingface.co/datasets/laion/laion-pop) dataset has an example very long, detailed prompt for general purpose Cog captioning in the readme. They are effectively using `starts_with` and `remove_starts_with` as well, which you can use similarly here (see below).
## Common options
`--starts_with "A photograph of"` will add the text given to the caption.
There are two circumstances where this is extremely useful. If you are captioning images that are all of the same subject, you can provide the subject's proper name and force it to be included. Such as `--starts_with "A photograph of John Smith"`. The caption will continue from there.
Another circumstance is to provide a starting phrase such a "An image showcasing" or "An image of", and follow up with using the `--remove_starts_with` option to remove the starting phrase from the caption. Often Cog will add "An image of" on its own, wasting tokens and making the caption less useful. By providing the starting phrase then removing it with `--remove_starts_with` you can short circuit the model to start in a more concise manner.
`--remove_starts_with` will remove the `starts_with` text from the start of the output caption. Suggested use is to use this if your starts_with is something like `an image of` but not if your starts_with is a proper noun.
`--append "by Claude Monet."` will add the text given to the end of every caption, and is not fed to the model, it is simply tacked on to the end of the caption. This can be useful for things like artist or collection names that are fixed across all images. This is "dumb code" string append.
`--no_overwrite` will skip captioning the image if a corresponding .txt file already exists, useful for resuming.
## Prompt modifcation plugins
The script has the ability to execute code to alter the prompt before it is sent to the model. This is an abstract capability and allows users to write their own plugins that execute python code, opening any capability you want to program.
Enable a plugin with `--prompt_plugin "plugin_key"` such as `--prompt_plugin "from_leaf_directory"`
Here are the working plugins that come with the script:
*`from_leaf_directory` Adds "hint: folder_name" to the front of your prompt. The leaf directory (immediate directory of image, not roots) of each image is used. Let's assume the `--prompt` is simply set to "Write A description" and go through an example.
Ex. if your data is structured as:
```
/mnt/mydata/training_data/Peru/001.jpg
/mnt/mydata/training_data/Argentina/002.jpg
```
The 001.jpg will have the prompt such as
```
hint: Peru
Write a description.
```
and 002.jpg will have the prompt adjusted like so:
```
Hint: Argentina
Write a description.
```
This is very useful if you can organize your data into folders that are meaningful to the captioning task, either manually, or with a classifier.
*`title_and_tags_from_metadata_json` Adds the title and tags from a metadata.json file in the same folder as the image to the prompt. This is useful if you have a metadata.json file in each folder with the images that applies to all the images in that folder. The metadata.json file should look like this:
*`title_and_tags_from_image_json` Same as above but looks for a file ending in `.json` with the same basename and in the same directory as the image (ex. `/myfolder/001.png`, `/myfolder/001.json`), enabling *per-image* metadata instead of a per-folder metadata file.
The plugins are all in `/plugins/caption_plugins.py` and are easy to modify or add to. The plugins are executed in the order they are provided on the command line. Inherit from the `PromptIdentityPlugin` class and spass a key for the arg and your function like `super().__init(key="my_cool_plugin",fn=your_fn)`. Should be obvious from there for anyone familiar with Python.
ChatGPT should be capable of writing these if you paste in the PromptIdentityPlugin class code and describe what you want it to do.
### General Sampling args (advanced users)
It's worth reading through Huggingface's [tips](https://huggingface.co/docs/transformers/generation_strategies) and [blog post](https://huggingface.co/blog/how-to-generate) as a start for tweaking sampling arguments. The [technical documenuts](https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/text_generation) for the Transformers pipeline also will help explain the parameters. The type of search (beam, greedy, probabilistic, etc) is set automatically based on your options. Default is greedy search (1 beam, no sampling args set).
I would recommend not setting any of these and leave the default values until you have time to read all of the above.
`--num_beams 1` more [beams](https://en.wikipedia.org/wiki/Beam_search) provide extra "opinions" on the next token to choose. Default is 1, but increasing this slightly may improve quality at the cost of significantly higher VRAM and slower processing. Setting this to 2 or higher enables beam search.
`--repetition_penalty 1.0` penalizes repeating tokens/words, can adjust up if you see repeated terms. 1.0 does nothing.
`--length_penalty 1.0` penalizes longer captions if <0.0orrewardslongercaptionsif>0.0. Adjusting down may produce somewhat abruptly ending output.
`--no_repeat_ngram_size 3` prevents the same n-gram (successive token sequence) from being repeated in the output. Can help prevent the model from repeating itself.
`--bad_words "foo,bar"` Attempts to prevent the model from using these words in the output caption. Comma-delimited. Very useful, consider trying `"depicts,poses,posing,showcases,appears,suggests"` to get more concise phrasing in captions. This is not a guarantee, due to [different tokenizations](https://github.com/huggingface/transformers/issues/17504) being possible for a given bad_word.
`--no_repeat_ngram_size 3` prevents the same n-gram (sequence of size n-tokens) from being repeated in the output. Default is 0, which means no n-gram is prevented from repeating. Setting this to 2 or 3 can help prevent the model from repeating itself.
These all control and enable multinomial sampling. Setting at least one will turn multinomial sampling on and set the other sampling args to default values if not set.
`--temperature 1.0` relates to randomness used for next token chosen.
`--top_k 50` Highest probability vocabulary size for filtering.