3.3 KiB

Raw Blame History

Using TGI with AMD GPUs

TGI is supported and tested on AMD Instinct MI210, MI250 and MI300 GPUs. The support may be extended in the future. The recommended usage is through Docker. Make sure to check the AMD documentation on how to use Docker with AMD GPUs.

On a server powered by AMD GPUs, TGI can be launched with the following command:

model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    --device=/dev/kfd --device=/dev/dri --group-add video \
    --ipc=host --shm-size 256g --net host -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.3.0-rocm \
    --model-id $model

The launched TGI server can then be queried from clients, make sure to check out the Consuming TGI guide.

TunableOp

TGI's docker image for AMD GPUs integrates PyTorch's TunableOp, which allows to do an additional warmup to select the best performing matrix multiplication (GEMM) kernel from rocBLAS or hipBLASLt.

Experimentally, on MI300X, we noticed a 6-8% latency improvement when using TunableOp on top of ROCm 6.1 and PyTorch 2.3.

TunableOp is enabled by default, the warmup may take 1-2 minutes. In case you would like to disable TunableOp, please pass --env PYTORCH_TUNABLEOP_ENABLED="0" when launcher TGI's docker container.

Flash attention implementation

Two implementations of Flash Attention are available for ROCm, the first is ROCm/flash-attention based on a Composable Kernel (CK) implementation, and the second is a Triton implementation.

By default, the Composable Kernel implementation is used. However, the Triton implementation has slightly lower latency on MI250 and MI300, but requires a warmup which can be prohibitive as it needs to be done again for each new prompt length. If needed, FA Triton impelmentation can be enabled with --env ROCM_USE_FLASH_ATTN_V2_TRITON="0" when launching TGI's docker container.

Custom PagedAttention

For better performance on ROCm, a custom Paged Attention kernel is available and is enabled by default. To disable it and fall back to the PagedAttention v2 kernel, set the environment variable ROCM_USE_CUSTOM_PAGED_ATTN=0.

The custom kernel supports bf16 and fp16 data types, block size of 16, head size of 128, a maximum context length of 16k, and GQA ratios between 1 and 16. For other configurations, we use the PagedAttention v2 kernel.

Unsupported features

The following features are currently not supported in the ROCm version of TGI, and the supported may be extended in the future:

Loading AWQ checkpoints.
Kernel for sliding window attention (Mistral)

3.3 KiB Raw Blame History