Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Go to file

Daniël de Kok c6d5039cd7 Add experimental flake (#2384 ) Add flake.nix		2024-08-09 12:32:37 +02:00
.devcontainer	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
.github	Fix cache block size for flash decoding (#2351 )	2024-08-01 15:38:57 +02:00
assets	Update grafana template (#1918 )	2024-05-17 17:37:23 +02:00
backends	Pr 2352 ci branch (#2382 )	2024-08-09 10:54:32 +02:00
benchmark	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
clients/python	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
docs	Update Quantization docs and minor doc fix. (#2368 )	2024-08-08 16:06:57 -04:00
integration-tests	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 )	2024-08-07 23:14:02 -04:00
launcher	refactor usage stats (#2339 )	2024-07-31 16:29:07 +02:00
load_tests	feat: add ruff and resolve issue (#2262 )	2024-07-26 10:29:09 -04:00
proto	Enable multiple LoRa adapters (#2010 )	2024-06-25 14:46:27 -04:00
router	Pr 2352 ci branch (#2382 )	2024-08-09 10:54:32 +02:00
server	Add FlashInfer support (#2354 )	2024-08-09 11:42:00 +02:00
.dockerignore	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
.gitignore	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
.pre-commit-config.yaml	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
.redocly.lint-ignore.yaml	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
CODE_OF_CONDUCT.md	Set maximum grpc message receive size to 2GiB (#2075 )	2024-06-17 16:40:44 +02:00
CONTRIBUTING.md	Set maximum grpc message receive size to 2GiB (#2075 )	2024-06-17 16:40:44 +02:00
Cargo.lock	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
Cargo.toml	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
Dockerfile	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
Dockerfile.trtllm	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
Dockerfile_amd	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
Dockerfile_intel	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
LICENSE	Revert license to Apache 2.0 (#1714 )	2024-04-08 15:06:16 +02:00
Makefile	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
README.md	Update Quantization docs and minor doc fix. (#2368 )	2024-08-08 16:06:57 -04:00
flake.lock	Add experimental flake (#2384 )	2024-08-09 12:32:37 +02:00
flake.nix	Add experimental flake (#2384 )	2024-08-09 12:32:37 +02:00
rust-toolchain.toml	Set maximum grpc message receive size to 2GiB (#2075 )	2024-06-17 16:40:44 +02:00
sagemaker-entrypoint.sh	feat(sagemaker): add trust remote code to entrypoint (#394 )	2023-06-02 09:51:06 +02:00
tgi-entrypoint.sh	Dev/mask ldconfig output v2 (#1716 )	2024-04-11 19:31:48 +02:00
update_doc.py	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00

README.md

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

Get Started
Optimized architectures
Run locally
- Run
- Quantization
Develop
Testing

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with :
- bitsandbytes
- GPT-Q
- EETQ
- AWQ
- Marlin
- fp8
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Speculation ~2x latency
Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.2.0 --model-id $model

And then you can make requests like

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.2.0-rocm --model-id $model instead of the command above.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HF_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

Go to https://huggingface.co/settings/tokens
Copy your cli READ token
Export HF_TOKEN=<your cli READ token>

or with Docker:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument. The default service name can be overridden with the --otlp-service-name argument

Architecture

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.11
conda activate text-generation-inference

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Develop

make server-dev
make router-dev

Testing

# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests