2.2 KiB

Raw Blame History

Supported Models and Hardware

Supported Models

List of optimized models are below.

If the above list lacks the model you would like to serve, depending on the model's pipeline type, you can try to initialize and serve the model on best-effort basis like below:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto").

For the optimized models above, TGI uses custom CUDA kernels for better inference. You can add the flag --disable-custom-kernels at the end of the docker run command if you wish to disable them.

Supported Hardware

Text Generation Inference optimized models supported on NVIDIA A100, A10G and T4 GPUs with CUDA 11.8+. Note that you have to install NVIDIA Container Toolkit to use it. For other hardware, continuous batching will still apply, but you might observe downgrades on some of the operations (e.g. flash attention, paged attention) will not be executed.

TGI is also supported on the following AI hardware accelerators:

Habana first-gen Gaudi and Gaudi2: checkout here how to serve models with TGI on Gaudi and Gaudi2 with Optimum Habana

2.2 KiB Raw Blame History

Supported Models and Hardware