Large Language Model Text Generation Inference
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Go to file
Nicolas Patry 5ba53d44a1
Fixing eetq dockerfile. (#1081)
# What does this PR do?

Fixes #1079 
Congratulations! You've made it this far! You're not quite done yet

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.

<!-- Remove if not applicable -->

Fixes # (issue)

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum]( Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[here are tips on formatting
- [ ] Did you write any new necessary tests?

## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

@OlivierDehaene OR @Narsil

24 hours ago
.github Automatic docs for TGI (#1045) 3 days ago
assets feat(benchmark): tui based benchmarking tool (#149) 6 months ago
benchmark Preping 1.1.0 (#1066) 3 days ago
clients/python feat: add mistral model (#1071) 2 days ago
docs Update to wrap code blocks (#1076) 2 days ago
integration-tests feat: add mistral model (#1071) 2 days ago
launcher Support eetq weight only quantization (#1068) 3 days ago
load_tests Adding small benchmark script. (#881) 1 month ago
proto feat: add mistral model (#1071) 2 days ago
router feat: add mistral model (#1071) 2 days ago
server Fixing eetq dockerfile. (#1081) 24 hours ago
.dockerignore chore: add `flash-attention` to docker ignore (#287) 5 months ago
.gitignore feat(server): Rework model loading (#344) 4 months ago
Cargo.lock Support eetq weight only quantization (#1068) 3 days ago
Cargo.toml Preping 1.1.0 (#1066) 3 days ago
Dockerfile Fixing eetq dockerfile. (#1081) 24 hours ago
LICENSE chore: update license to HFOIL (#725) 2 months ago
Makefile docs(README): update readme 2 months ago update readme 2 days ago
rust-toolchain.toml v0.9.0 (#525) 3 months ago feat(sagemaker): add trust remote code to entrypoint (#394) 4 months ago Update to wrap code blocks (#1076) 2 days ago


Text Generation Inference

GitHub Repo stars Swagger API documentation

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power Hugging Chat, the Inference API and Inference Endpoint.

Table of contents


  • Serve the most popular Large Language Models with a simple launcher
  • Tensor Parallelism for faster inference on multiple GPUs
  • Token streaming using Server-Sent Events (SSE)
  • Continuous batching of incoming requests for increased total throughput
  • Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
  • Quantization with bitsandbytes and GPT-Q
  • Safetensors weight loading
  • Watermarking with A Watermark for Large Language Models
  • Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
  • Stop sequences
  • Log probabilities
  • Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
  • Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output.
  • Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance.

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")


AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Get started


The easiest way of getting started is using the official Docker container:

volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data --model-id $model

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the --gpus all flag and add --disable-custom-kernels, please note CPU is not the intended platform for this project, so performance might be subpar.

To see all options to serve your models (in the code or in the cli):

text-generation-launcher --help

You can then query the model using either the /generate or /generate_stream routes:

curl \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
curl \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

or from Python:

pip install text-generation
from text_generation import Client

client = Client("")
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)

text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
    if not response.token.special:
        text += response.token.text

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at:

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

  1. Go to
  2. Copy your cli READ token
  3. Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --model-id $model

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf | sh

conda create -n text-generation-inference python=3.9
conda activate text-generation-inference

You may also need to install Protoc.

On Linux:
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
make run-falcon-7b-instruct

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove the kernels by using the DISABLE_CUSTOM_KERNELS=True environment variable.

Be aware that the official Docker image has them enabled by default.

Run Falcon


make run-falcon-7b-instruct


You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-falcon-7b-instruct-quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.


make server-dev
make router-dev


# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests

Other supported hardware

TGI is also supported on the following AI hardware accelerators:

  • Habana first-gen Gaudi and Gaudi2: checkout here how to serve models with TGI on Gaudi and Gaudi2 with Optimum Habana