EveryDream2trainer/doc/CAPTION.md

1.8 KiB

Captioning tools

CogVLM

CogVLM is, so far, the best model for generating synthetic captions. The script for Cog is enhanced, so read the CogVLM README for more information.

Kosmos-2

Microsoft's Kosmos-2 is significantly lighter weight than Cog, using <5GB of VRAM and generating captions in under a second on a RTX 3090.

It has the capability to output grounding bounding boxes.

Run python caption_kosmos2.py --help to get a list of options.

Kosmos-2 grounding

Kosmos-2can generate bounding boxes for the "grounding" of the caption. This is useful for identifying specific objects in the image in 2D space, which can be useful in later piplines.

It's worth reading the documentation here to understand the grounding output.

--save_entities outputs a '.ent' file with bounding box information. The entities identified will be based on what caption is produced.

--phrase_mode This modifies how the model is called, wrapping phrases in <phrase> tags. This also interprets your prompt as a CSV, wrapping each item in a phrase tag. You might use it with --prompt "dog,cat,tree" for instance. This is not a gaurantee your phrases will be found and output into the grounding output file.

--save_entities_only This will not attempt to write the caption into the .txt file at all. This is recommended with --phrase_mode. Using this option forces --save_entities.

There is a trivial/dumb UI for viewing the grounding in the scripts folder. Launch it with python scripts/grounding_ui.py and it will open a window allowing you to select a directory, and it will display the images and bounding boxes.