Update Readme.md / documentation (#15)

* add documentation updates

* update readme

* Update README.md
This commit is contained in:
Michael Feil 2023-10-04 08:01:06 +02:00 committed by GitHub
parent ff703cb867
commit 339ede9e90
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 44 additions and 2 deletions

View File

@ -20,3 +20,45 @@ Our long-term goal is to grow the community around this repository, as a playgro
### 4bit quantization
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
### CTranslate2
Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.
### Chat Completions in OpenAI Format
`/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI.
You may set the `TGICHAT_(USER|ASS|SYS)_(PRE|POST)` environment variables, to wrap the chat messages.
<details>
<summary>Optimal Llama-2-Chat config</summary>
For Llama-2, you should wrap each chat message with a different strings, depending on the role.
Supported roles are `assistant`, `user`, `system`.
```bash
TGICHAT_USER_PRE=" [INST] "
TGICHAT_USER_POST=" [\\INST] "
TGICHAT_ASS_PRE=""
TGICHAT_ASS_POST=""
TGICHAT_SYS_PRE=" [INST] <<SYS>> "
TGICHAT_SYS_POST=" <</SYS>> [\\INST] "
```
Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token.
</details>
## Get started with Docker
```bash
model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
image=docker.io/michaelf34/tgi:03-10-2023 # docker image by @michaelfeil
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2
```
To see all options of `text-generation-launcher` you may use the `--help` command:
```bash
docker run $image --help
```

View File

@ -99,7 +99,7 @@ async fn compat_generate(
}
}
/// Plain Completion request. Enable stream of token by setting `stream == true`
/// Plain Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
#[utoipa::path(
post,
tag = "Text Generation Inference",
@ -147,7 +147,7 @@ async fn completions_generate(
}
}
/// Chat Completion request. Enable stream of token by setting `stream == true`
/// Chat Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
#[utoipa::path(
post,
tag = "Text Generation Inference",