hf_text-generation-inference/docs/source/basic_tutorials/preparing_model.md

2.2 KiB

Preparing the Model

Text Generation Inference improves the model in several aspects.

Quantization

TGI supports bits-and-bytes, GPT-Q, AWQ, Marlin, EETQ, EXL2, and fp8 quantization. To speed up inference with quantization, simply set quantize flag to bitsandbytes, gptq, awq, marlin, exl2, eetq or fp8 depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models here. Similarly, when using AWQ quantization, you need to point to one of these models. To get more information about quantization, please refer to quantization guide

RoPE Scaling

RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass --rope-scaling, --max-input-length and --rope-factors flags when running through CLI. --rope-scaling can take the values linear or dynamic. If your model is not fine-tuned to a longer sequence length, use dynamic. --rope-factor is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass --max-input-length to provide maximum input length for extension.

We recommend using dynamic RoPE scaling.

Safetensors

Safetensors is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports safetensors model loading under the hood. By default, given a repository with safetensors and pytorch weights, TGI will always load safetensors. If there's no pytorch weights, TGI will convert the weights to safetensors format.