Large Language Model Text Generation Inference

bloom deep-learning falcon gpt inference nlp pytorch starcoder transformer

Go to file

Nicolas Patry a0d55358d2 feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671 ) - Current PR is not great because we're side stepping the `Weights.__init__` but Weights shouldn't requires anything related to the config or the model_id as it aims to be a simple Wrapper over multi file loading. - Ideal solution would be to use something like Rust enum ``` enum Quantize{ Bitandbytes(Bitsandbytes), GPTQ(bits: usize, groupsize: usize) ``` And passing that around during load. Unfortunately we don't have access to this, so for now, side-stepping seems easier. - Re-enabling groupsize<0 with exllama (confirmed it works.) Helps #601 In next steps we should make sure our quantization script uses that format and make it standard. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->		2023-07-25 13:00:27 +02:00
.github	chore: migrate ci region for more availability. (#581 )	2023-07-12 10:01:01 +02:00
assets	feat(benchmark): tui based benchmarking tool (#149 )	2023-03-30 15:26:27 +02:00
benchmark	docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462 )	2023-07-04 18:35:37 +02:00
clients/python	feat(server): only compute prefill logprobs when asked (#406 )	2023-06-02 17:12:30 +02:00
docs	v0.9.3 (#634 )	2023-07-18 18:11:20 +02:00
integration-tests	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
launcher	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
load_tests	feat: add nightly load testing (#358 )	2023-05-23 17:42:19 +02:00
proto	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
router	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
server	feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671 )	2023-07-25 13:00:27 +02:00
.dockerignore	chore: add `flash-attention` to docker ignore (#287 )	2023-05-05 17:52:09 +02:00
.gitignore	feat(server): Rework model loading (#344 )	2023-06-08 14:51:52 +02:00
Cargo.lock	v0.9.3 (#634 )	2023-07-18 18:11:20 +02:00
Cargo.toml	v0.9.3 (#634 )	2023-07-18 18:11:20 +02:00
Dockerfile	feat(server): Add exllama GPTQ CUDA kernel support #553 (#666 )	2023-07-21 10:59:00 +02:00
LICENSE	Create LICENSE (#2 )	2022-10-22 10:44:52 +02:00
Makefile	feat(server): Add exllama GPTQ CUDA kernel support #553 (#666 )	2023-07-21 10:59:00 +02:00
README.md	docs: Update README.md (#643 )	2023-07-19 13:39:12 +02:00
rust-toolchain.toml	v0.9.0 (#525 )	2023-07-01 19:25:41 +02:00
sagemaker-entrypoint.sh	feat(sagemaker): add trust remote code to entrypoint (#394 )	2023-06-02 09:51:06 +02:00

README.md

Text Generation Inference

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.

Features
Optimized Architectures
Get Started
Run BLOOM
Develop
Testing

Features

Serve the most popular Large Language Models with a simple launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

Get started

Docker

The easiest way of getting started is using the official Docker container:

model=bigscience/bloom-560m
num_shard=2
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9 --model-id $model --num-shard $num_shard

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

To see all options to serve your models (in the code or in the cli:

text-generation-launcher --help

You can then query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

or from Python:

pip install text-generation

from text_generation import Client

client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)

text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=17):
    if not response.token.special:
        text += response.token.text
print(text)

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-generation-inference.

Using on private models or gated models

You can use HUGGING_FACE_HUB_TOKEN environment variable to set the token used by text-generation-inference to give access to protected ressources.

Distributed Tracing

text-generation-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

A note on Shared Memory (shm)

NCCL is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to share data between the different devices of a NCCL group, NCCL might fall back to using the host memory if peer-to-peer using NVLink or PCI is not possible.

To allow the container to use 1G of Shared Memory and support SHM sharing, we add --shm-size 1g on the above command.

If you are running text-generation-inference inside Kubernetes. You can also add Shared Memory to the container by creating a volume with:

- name: shm
  emptyDir:
   medium: Memory
   sizeLimit: 1Gi

and mounting it to /dev/shm.

Finally, you can also disable SHM sharing by using the NCCL_SHM_DISABLE=1 environment variable. However, note that this will impact performance.

Local install

You can also opt to install text-generation-inference locally.

First install Rust and create a Python virtual environment with at least Python 3.9, e.g. using conda:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

conda create -n text-generation-inference python=3.9 
conda activate text-generation-inference

You may also need to install Protoc.

On Linux:

PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

On MacOS, using Homebrew:

brew install protobuf

Then run:

BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
make run-bloom-560m

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

CUDA Kernels

The custom CUDA kernels are only tested on NVIDIA A100s. If you have any installation or runtime issues, you can remove the kernels by using the DISABLE_CUSTOM_KERNELS=True environment variable.

Be aware that the official Docker image has them enabled by default.

Run BLOOM

Download

It is advised to download the weights ahead of time with the following command:

make download-bloom

Run

make run-bloom # Requires 8xA100 80GB

Quantization

You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-bloom-quantize # Requires 8xA100 40GB

Develop

make server-dev
make router-dev

Testing

# python
make python-server-tests
make python-client-tests
# or both server and client tests
make python-tests
# rust cargo tests
make rust-tests
# integration tests
make integration-tests