Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Go to file

Nicolas Patry 5afc98a7d7 Snapshot update with vllm paged.		2024-07-25 12:17:40 +02:00
.github	Using g6 instead of g5.	2024-07-23 11:13:55 +02:00
assets	Update grafana template (#1918 )	2024-05-17 17:37:23 +02:00
benchmark	Enable multiple LoRa adapters (#2010 )	2024-06-25 14:46:27 -04:00
clients/python	legacy warning on text_generation client (#2271 )	2024-07-22 12:00:17 +02:00
docs	Add support for Deepseek V2 (#2224 )	2024-07-19 17:23:20 +02:00
integration-tests	Snapshot update with vllm paged.	2024-07-25 12:17:40 +02:00
launcher	Softcapping for gemma2. (#2273 )	2024-07-22 18:27:10 +02:00
load_tests	Adding scripts to prepare load data. (#1841 )	2024-05-01 21:48:06 +02:00
proto	Enable multiple LoRa adapters (#2010 )	2024-06-25 14:46:27 -04:00
router	fix: adjust default tool choice (#2244 )	2024-07-19 11:12:02 -04:00
server	Softcapping for gemma2. (#2273 )	2024-07-22 18:27:10 +02:00
.dockerignore	chore: add `flash-attention` to docker ignore (#287 )	2023-05-05 17:52:09 +02:00
.gitignore	Adding scripts to prepare load data. (#1841 )	2024-05-01 21:48:06 +02:00
.pre-commit-config.yaml	chore: add pre-commit (#1569 )	2024-02-16 11:58:58 +01:00
CODE_OF_CONDUCT.md	Set maximum grpc message receive size to 2GiB (#2075 )	2024-06-17 16:40:44 +02:00
CONTRIBUTING.md	Set maximum grpc message receive size to 2GiB (#2075 )	2024-06-17 16:40:44 +02:00
Cargo.lock	usage stats and crash reports (#2220 )	2024-07-19 16:17:56 +02:00
Cargo.toml	Preparing patch release. (#2186 )	2024-07-04 10:55:33 +02:00
Dockerfile	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 )	2024-07-20 19:02:04 +02:00
Dockerfile_amd	Fixing the dockerfile warnings. (#2173 )	2024-07-03 12:48:45 +02:00
Dockerfile_intel	Fixing the dockerfile warnings. (#2173 )	2024-07-03 12:48:45 +02:00
LICENSE	Revert license to Apache 2.0 (#1714 )	2024-04-08 15:06:16 +02:00
Makefile	Making `make install` work better by default. (#2004 )	2024-06-04 19:38:46 +02:00
README.md	Fixed README ToC (#2196 )	2024-07-09 11:22:08 +02:00
rust-toolchain.toml	Set maximum grpc message receive size to 2GiB (#2075 )	2024-06-17 16:40:44 +02:00
sagemaker-entrypoint.sh	feat(sagemaker): add trust remote code to entrypoint (#394 )	2023-06-02 09:51:06 +02:00
tgi-entrypoint.sh	Dev/mask ldconfig output v2 (#1716 )	2024-04-11 19:31:48 +02:00
update_doc.py	Updating the self check (#2209 )	2024-07-09 17:23:48 +02:00

README.md

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

Get Started
Optimized architectures
Run locally
- Run
- Quantization
Develop
Testing

Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. TGI implements many features, such as:

Simple launcher to serve most popular LLMs
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using Flash Attention and Paged Attention on the most popular architectures
Quantization with :
- bitsandbytes
- GPT-Q
- EETQ
- AWQ
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Speculation ~2x latency
Guidance/JSON. Specify output format to speed up inference and make sure the output is valid according to some specs..
Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output
Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance

Hardware support

Get Started

Docker

For a detailed starting guide, please see the Quick Tour. The easiest way of getting started is using the official Docker container:

model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id $model

And then you can make requests like

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

Note: To use NVIDIA GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

Note: TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the Supported Hardware documentation. To use AMD GPUs, please use docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.1.1-rocm --model-id $model instead of the command above.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HF_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

Go to https://huggingface.co/settings/tokens
Copy your cli READ token
Export HF_TOKEN=<your cli READ token>

or with Docker:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HF_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument. The default service name can be overridden with the --otlp-service-name argument

Architecture

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.11
conda activate text-generation-inference

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

Optimized architectures

TGI works out of the box to serve optimized models for all modern models. They can be found in this list.

Other architectures are supported on a best-effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Run locally

Run

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

Develop

make server-dev
make router-dev

Testing

# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests