Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Go to file

Yannic Kilcher e520d5b349 fixed SSE naming (#61 ) https://en.wikipedia.org/wiki/Server-sent_events		2023-02-08 22:30:11 +01:00
.github/workflows	feat(ci): push to AML registry (#56 )	2023-02-06 14:33:56 +01:00
aml	feat(ci): push to AML registry (#56 )	2023-02-06 14:33:56 +01:00
assets	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
docs	fixed SSE naming (#61 )	2023-02-08 22:30:11 +01:00
k6	Add load testing	2022-10-11 10:36:51 +02:00
launcher	fix(docker): increase shm size (#60 )	2023-02-08 17:53:33 +01:00
proto	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
router	fixed SSE naming (#61 )	2023-02-08 22:30:11 +01:00
server	fix(docker): increase shm size (#60 )	2023-02-08 17:53:33 +01:00
.dockerignore	fix(server): Fix Transformers fork version	2022-11-08 17:42:38 +01:00
.gitignore	v0.1.0	2022-10-20 19:14:44 +02:00
Cargo.lock	V0.2.1 (#58 )	2023-02-07 15:40:25 +01:00
Cargo.toml	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
Dockerfile	fix(docker): increase shm size (#60 )	2023-02-08 17:53:33 +01:00
LICENSE	Create LICENSE (#2 )	2022-10-22 10:44:52 +02:00
Makefile	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
README.md	fixed SSE naming (#61 )	2023-02-08 22:30:11 +01:00
rust-toolchain.toml	feat(rust): Update to 1.65	2022-11-14 13:59:56 +01:00

README.md

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.

Features
Officially Supported Models
Get Started
Run BLOOM
Develop
Testing

Features

Token streaming using Server-Sent Events (SSE)
Dynamic batching of incoming requests for increased total throughput
Quantization with bitsandbytes
Safetensors weight loading
45ms per token generation for BLOOM with 8xA100 80GB
Logits warpers (temperature scaling, topk, repetition penalty ...)
Stop sequences
Log probabilities

Officially supported models

Other models are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Get started

Docker

The easiest way of getting started is using the official Docker container:

model=bigscience/bloom-560m
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard

You can then query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"Testing API","parameters":{"max_new_tokens":9}}' \
    -H 'Content-Type: application/json'

Note: To use GPUs, you need to install the NVIDIA Container Toolkit.

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
make run-bloom-560m

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove the kernels by using the BUILD_EXTENSIONS=False environment variable.

Be aware that the official Docker image has them enabled by default.

Run BLOOM

Download

First you need to download the weights:

make download-bloom

Run

make run-bloom # Requires 8xA100 80GB

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Develop

make server-dev
make router-dev

Testing

make python-tests
make integration-tests