hf_text-generation-inference/docs/source/basic_tutorials/preparing_model.md

1.7 KiB

Preparing the Model

Text Generation Inference improves the model in several aspects.

Quantization

TGI supports bits-and-bytes and GPT-Q quantization. To speed up inference with quantization, simply set quantize flag to bitsandbytes or gptq depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models here. To get more information about quantization, please refer to (./conceptual/quantization.md)

RoPE Scaling

RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass --rope-scaling, --max-input-length and --rope-factors flags when running through CLI. --rope-scaling can take the values linear or dynamic. If your model is not fine-tuned to a longer sequence length, use dynamic. --rope-factor is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass --max-input-length to provide maximum input length for extension.

We recommend using dynamic RoPE scaling.

Safetensors

Safetensors is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports safetensors model loading under the hood. By default, given a repository with safetensors and pytorch weights, TGI will always load safetensors. If there's no pytorch weights, TGI will convert the weights to safetensors format.