Go to file

Michael Feil 972e9a7f7c update causal batch for ct2 and fix nf4 (#17 ) * update causal batch for ct2 and fix nf4 * bump the ctranslate2 version --------- Co-authored-by: Michael Feil <michael.feil@michaelfeil.eu>		2024-02-09 11:07:14 -08:00
.github	update PR template	2023-08-01 18:18:28 +02:00
assets	feat(benchmark): tui based benchmarking tool (#149 )	2023-03-30 15:26:27 +02:00
benchmark	Compilation fix: Correct method argument types in generation.rs and validation.rs (#10 )	2023-08-23 13:52:49 -07:00
clients/python	feat(server): only compute prefill logprobs when asked (#406 )	2023-06-02 17:12:30 +02:00
docs	Wrapping completions and chat/completions endpoint (#2 )	2023-09-27 08:58:07 -07:00
integration-tests	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
launcher	Adding ctranslate2 quantization and inference: moving the contribution (#1 )	2023-10-02 11:12:49 -07:00
load_tests	feat: add nightly load testing (#358 )	2023-05-23 17:42:19 +02:00
proto	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
router	Update Readme.md / documentation (#15 )	2023-10-03 23:01:06 -07:00
server	update causal batch for ct2 and fix nf4 (#17 )	2024-02-09 11:07:14 -08:00
.dockerignore	chore: add `flash-attention` to docker ignore (#287 )	2023-05-05 17:52:09 +02:00
.gitignore	feat(server): Rework model loading (#344 )	2023-06-08 14:51:52 +02:00
Cargo.lock	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
Cargo.toml	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
Dockerfile	Adding ctranslate2 quantization and inference: moving the contribution (#1 )	2023-10-02 11:12:49 -07:00
LICENSE	Claim copyright (#7 )	2023-08-02 17:23:54 -07:00
Makefile	docs(README): update readme	2023-07-25 19:45:25 +02:00
README-HuggingFace.md	Add a new README (#3 )	2023-08-01 12:22:07 -07:00
README.md	update causal batch for ct2 and fix nf4 (#17 )	2024-02-09 11:07:14 -08:00
rust-toolchain.toml	v0.9.0 (#525 )	2023-07-01 19:25:41 +02:00
sagemaker-entrypoint.sh	feat(sagemaker): add trust remote code to entrypoint (#394 )	2023-06-02 09:51:06 +02:00

README.md

Text Generation Inference

This is Preemo's fork of text-generation-inference, originally developed by Hugging Face. The original README is at README-HuggingFace.md. Since Hugging Face's text-generation-inference is no longer open-source, we have forked it and will continue to develop it here.

Our goal is to create an open-source text generation inference server that is modularized to allow for easy add state-of-the-art models, functionalities and optimizations. Functionalities and optimizations should be composable, so that users can easily combine them to create a custom inference server that fits their needs.

our plan

We at Preemo are currently busy working on our first release of our other product, so we expect to be able to start open-source development on this repository in September 2023. We will be working on the following, to ease the external contributions:

Adding a public visible CI/CD pipeline that runs tests and builds docker images
Unifying the build tools
Modularizing the codebase, introducing a plugin system

Our long-term goal is to grow the community around this repository, as a playground for trying out new ideas and optimizations in LLM inference. We at Preemo will implement features that interest us, but we also welcome contributions from the community, as long as they are modularized and composable.

Extra features in comparison to Hugging Face `text-generation-inference` v0.9.4

4bit quantization

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

CTranslate2

Int8 Ctranslate2 quantization is available using the --quantize ct2 as a command line argument to text-generation-launcher. It will convert the PyTorch Model provided in --model-id on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.

Chat Completions in OpenAI Format

/chat/completions and /completions endpoints are available, using the API schema commonly known from OpenAI. You may set the TGICHAT_(USER|ASS|SYS)_(PRE|POST) environment variables, to wrap the chat messages.

Optimal Llama-2-Chat config

For Llama-2, you should wrap each chat message with a different strings, depending on the role. Supported roles are `assistant`, `user`, `system`.

TGICHAT_USER_PRE=" [INST] "
TGICHAT_USER_POST=" [\\INST] "
TGICHAT_ASS_PRE=""
TGICHAT_ASS_POST=""
TGICHAT_SYS_PRE=" [INST] <<SYS>> "
TGICHAT_SYS_POST=" <</SYS>> [\\INST] "

Note: To access a gated model, you may need to set: HUGGING_FACE_HUB_TOKEN for your access token.

Get started with Docker

model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
image=docker.io/michaelf34/tgi:05-11-2023 # docker image by @michaelfeil

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2

To see all options of text-generation-launcher you may use the --help command:

docker run $image --help