hf_text-generation-inference/docs/source/installation_amd.md

2.7 KiB

Using TGI with AMD GPUs

TGI is supported and tested on AMD Instinct MI210, MI250 and MI300 GPUs. The support may be extended in the future. The recommended usage is through Docker. Make sure to check the AMD documentation on how to use Docker with AMD GPUs.

On a server powered by AMD GPUs, TGI can be launched with the following command:

model=teknium/OpenHermes-2.5-Mistral-7B
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    --device=/dev/kfd --device=/dev/dri --group-add video \
    --ipc=host --shm-size 256g --net host -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.0.3-rocm \
    --model-id $model

The launched TGI server can then be queried from clients, make sure to check out the Consuming TGI guide.

TunableOp

TGI's docker image for AMD GPUs integrates PyTorch's TunableOp, which allows to do an additional warmup to select the best performing matrix multiplication (GEMM) kernel from rocBLAS or hipBLASLt.

Experimentally, on MI300X, we noticed a 6-8% latency improvement when using TunableOp on top of ROCm 6.1 and PyTorch 2.3.

TunableOp is enabled by default, the warmup may take 1-2 minutes. In case you would like to disable TunableOp, please pass --env PYTORCH_TUNABLEOP_ENABLED="0" when launcher TGI's docker container.

Flash attention implementation

Two implementations of Flash Attention are available for ROCm, the first is ROCm/flash-attention based on a Composable Kernel (CK) implementation, and the second is a Triton implementation.

By default, as its performances have experimentally been better, Triton implementation is used. It can be disabled (using CK implementation instead) by passing --env ROCM_USE_FLASH_ATTN_V2_TRITON="0" when launching TGI's docker container.

Unsupported features

The following features are currently not supported in the ROCm version of TGI, and the supported may be extended in the future:

  • Loading AWQ checkpoints.
  • Kernel for sliding window attention (Mistral)