Update Readme.md / documentation (#15)

* add documentation updates * update readme * Update README.md
2023-10-04 08:01:06 +02:00 · 2023-10-04 08:01:06 +02:00 · 339ede9e90
parent ff703cb867
commit 339ede9e90
2 changed files with 44 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -20,3 +20,45 @@ Our long-term goal is to grow the community around this repository, as a playgro
 ### 4bit quantization
 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
 ### CTranslate2
 Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.
 ### Chat Completions in OpenAI Format
 `/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI.
 You may set the `TGICHAT_(USER|ASS|SYS)_(PRE|POST)` environment variables, to wrap the chat messages.
 <details>
  <summary>Optimal Llama-2-Chat config</summary>
  For Llama-2, you should wrap each chat message with a different strings, depending on the role.
  Supported roles are `assistant`, `user`, `system`.
  ```bash
  TGICHAT_USER_PRE=" [INST] "
  TGICHAT_USER_POST=" [\\INST] "
  TGICHAT_ASS_PRE=""
  TGICHAT_ASS_POST=""
  TGICHAT_SYS_PRE=" [INST] <<SYS>> "
  TGICHAT_SYS_POST=" <</SYS>> [\\INST] "
  ```
  Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token.
 </details>
 ## Get started with Docker
 ```bash
 model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.
 volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
 image=docker.io/michaelf34/tgi:03-10-2023 # docker image by @michaelfeil
 docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2
 ```
 To see all options of `text-generation-launcher` you may use the `--help` command: 
 ```bash
 docker run $image --help
 ```
--- a/router/src/server.rs
+++ b/router/src/server.rs
@ -99,7 +99,7 @@ async fn compat_generate(
    }
 }
-/// Plain Completion request. Enable stream of token by setting `stream == true`
+/// Plain Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
 #[utoipa::path(
    post,
    tag = "Text Generation Inference",
@ -147,7 +147,7 @@ async fn completions_generate(
    }
 }
-/// Chat Completion request. Enable stream of token by setting `stream == true`
+/// Chat Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
 #[utoipa::path(
    post,
    tag = "Text Generation Inference",