Large Language Model Text Generation Inference
Go to file
Yannic Kilcher e520d5b349
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
2023-02-08 22:30:11 +01:00
.github/workflows feat(ci): push to AML registry (#56) 2023-02-06 14:33:56 +01:00
aml feat(ci): push to AML registry (#56) 2023-02-06 14:33:56 +01:00
assets feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00
docs fixed SSE naming (#61) 2023-02-08 22:30:11 +01:00
k6 Add load testing 2022-10-11 10:36:51 +02:00
launcher fix(docker): increase shm size (#60) 2023-02-08 17:53:33 +01:00
proto feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00
router fixed SSE naming (#61) 2023-02-08 22:30:11 +01:00
server fix(docker): increase shm size (#60) 2023-02-08 17:53:33 +01:00
.dockerignore fix(server): Fix Transformers fork version 2022-11-08 17:42:38 +01:00
.gitignore v0.1.0 2022-10-20 19:14:44 +02:00
Cargo.lock V0.2.1 (#58) 2023-02-07 15:40:25 +01:00
Cargo.toml feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00
Dockerfile fix(docker): increase shm size (#60) 2023-02-08 17:53:33 +01:00
LICENSE Create LICENSE (#2) 2022-10-22 10:44:52 +02:00
Makefile feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00
README.md fixed SSE naming (#61) 2023-02-08 22:30:11 +01:00
rust-toolchain.toml feat(rust): Update to 1.65 2022-11-14 13:59:56 +01:00

README.md

Text Generation Inference

GitHub Repo stars License Swagger API documentation

architecture

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.

Table of contents

Features

  • Token streaming using Server-Sent Events (SSE)
  • Dynamic batching of incoming requests for increased total throughput
  • Quantization with bitsandbytes
  • Safetensors weight loading
  • 45ms per token generation for BLOOM with 8xA100 80GB
  • Logits warpers (temperature scaling, topk, repetition penalty ...)
  • Stop sequences
  • Log probabilities

Officially supported models

Other models are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

or

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Get started

Docker

The easiest way of getting started is using the official Docker container:

model=bigscience/bloom-560m
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard

You can then query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'
curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Note: To use GPUs, you need to install the NVIDIA Container Toolkit.

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
make run-bloom-560m

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove the kernels by using the BUILD_EXTENSIONS=False environment variable.

Be aware that the official Docker image has them enabled by default.

Run BLOOM

Download

First you need to download the weights:

make download-bloom

Run

make run-bloom # Requires 8xA100 80GB

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Develop

make server-dev
make router-dev

Testing

make python-tests
make integration-tests