diff --git a/README.md b/README.md index 6825249..02a94d2 100644 --- a/README.md +++ b/README.md @@ -20,3 +20,45 @@ Our long-term goal is to grow the community around this repository, as a playgro ### 4bit quantization 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`. + +### CTranslate2 + +Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU. + +### Chat Completions in OpenAI Format + +`/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI. +You may set the `TGICHAT_(USER|ASS|SYS)_(PRE|POST)` environment variables, to wrap the chat messages. + +
+ Optimal Llama-2-Chat config + For Llama-2, you should wrap each chat message with a different strings, depending on the role. + Supported roles are `assistant`, `user`, `system`. + + ```bash + TGICHAT_USER_PRE=" [INST] " + TGICHAT_USER_POST=" [\\INST] " + TGICHAT_ASS_PRE="" + TGICHAT_ASS_POST="" + TGICHAT_SYS_PRE=" [INST] <> " + TGICHAT_SYS_POST=" <> [\\INST] " + ``` + + Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token. + +
+ +## Get started with Docker + +```bash +model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram. +volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run +image=docker.io/michaelf34/tgi:03-10-2023 # docker image by @michaelfeil + +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2 +``` + +To see all options of `text-generation-launcher` you may use the `--help` command: +```bash +docker run $image --help +``` diff --git a/router/src/server.rs b/router/src/server.rs index 4b5bf20..1284529 100644 --- a/router/src/server.rs +++ b/router/src/server.rs @@ -99,7 +99,7 @@ async fn compat_generate( } } -/// Plain Completion request. Enable stream of token by setting `stream == true` +/// Plain Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1) #[utoipa::path( post, tag = "Text Generation Inference", @@ -147,7 +147,7 @@ async fn completions_generate( } } -/// Chat Completion request. Enable stream of token by setting `stream == true` +/// Chat Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1) #[utoipa::path( post, tag = "Text Generation Inference",