Update Readme.md / documentation (#15)

* add documentation updates * update readme * Update README.md
2023-10-04 08:01:06 +02:00 · 2023-10-04 08:01:06 +02:00 · 339ede9e90
parent ff703cb867
commit 339ede9e90
2 changed files with 44 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -20,3 +20,45 @@ Our long-term goal is to grow the community around this repository, as a playgro
 ### 4bit quantization

 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
+
+### CTranslate2
+
+Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.
+
+### Chat Completions in OpenAI Format
+
+`/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI.
+You may set the `TGICHAT_(USER|ASS|SYS)_(PRE|POST)` environment variables, to wrap the chat messages.
+
+<details>
+  <summary>Optimal Llama-2-Chat config</summary>
+  For Llama-2, you should wrap each chat message with a different strings, depending on the role.
+  Supported roles are `assistant`, `user`, `system`.
+  
+  ```bash
+  TGICHAT_USER_PRE=" [INST] "
+  TGICHAT_USER_POST=" [\\INST] "
+  TGICHAT_ASS_PRE=""
+  TGICHAT_ASS_POST=""
+  TGICHAT_SYS_PRE=" [INST] <<SYS>> "
+  TGICHAT_SYS_POST=" <</SYS>> [\\INST] "
+  ```
+
+  Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token.
+  
+</details>
+
+## Get started with Docker
+
+```bash
+model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+image=docker.io/michaelf34/tgi:03-10-2023 # docker image by @michaelfeil
+
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2
+```
+
+To see all options of `text-generation-launcher` you may use the `--help` command: 
+```bash
+docker run $image --help
+```
--- a/router/src/server.rs
+++ b/router/src/server.rs
@ -99,7 +99,7 @@ async fn compat_generate(
    }
 }

-/// Plain Completion request. Enable stream of token by setting `stream == true`
+/// Plain Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
 #[utoipa::path(
    post,
    tag = "Text Generation Inference",
@ -147,7 +147,7 @@ async fn completions_generate(
    }
 }

-/// Chat Completion request. Enable stream of token by setting `stream == true`
+/// Chat Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
 #[utoipa::path(
    post,
    tag = "Text Generation Inference",