Update Readme.md / documentation (#15)
* add documentation updates * update readme * Update README.md
This commit is contained in:
parent
ff703cb867
commit
339ede9e90
42
README.md
42
README.md
|
@ -20,3 +20,45 @@ Our long-term goal is to grow the community around this repository, as a playgro
|
||||||
### 4bit quantization
|
### 4bit quantization
|
||||||
|
|
||||||
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
|
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
|
||||||
|
|
||||||
|
### CTranslate2
|
||||||
|
|
||||||
|
Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.
|
||||||
|
|
||||||
|
### Chat Completions in OpenAI Format
|
||||||
|
|
||||||
|
`/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI.
|
||||||
|
You may set the `TGICHAT_(USER|ASS|SYS)_(PRE|POST)` environment variables, to wrap the chat messages.
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Optimal Llama-2-Chat config</summary>
|
||||||
|
For Llama-2, you should wrap each chat message with a different strings, depending on the role.
|
||||||
|
Supported roles are `assistant`, `user`, `system`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
TGICHAT_USER_PRE=" [INST] "
|
||||||
|
TGICHAT_USER_POST=" [\\INST] "
|
||||||
|
TGICHAT_ASS_PRE=""
|
||||||
|
TGICHAT_ASS_POST=""
|
||||||
|
TGICHAT_SYS_PRE=" [INST] <<SYS>> "
|
||||||
|
TGICHAT_SYS_POST=" <</SYS>> [\\INST] "
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token.
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
## Get started with Docker
|
||||||
|
|
||||||
|
```bash
|
||||||
|
model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.
|
||||||
|
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
|
||||||
|
image=docker.io/michaelf34/tgi:03-10-2023 # docker image by @michaelfeil
|
||||||
|
|
||||||
|
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2
|
||||||
|
```
|
||||||
|
|
||||||
|
To see all options of `text-generation-launcher` you may use the `--help` command:
|
||||||
|
```bash
|
||||||
|
docker run $image --help
|
||||||
|
```
|
||||||
|
|
|
@ -99,7 +99,7 @@ async fn compat_generate(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Plain Completion request. Enable stream of token by setting `stream == true`
|
/// Plain Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
|
||||||
#[utoipa::path(
|
#[utoipa::path(
|
||||||
post,
|
post,
|
||||||
tag = "Text Generation Inference",
|
tag = "Text Generation Inference",
|
||||||
|
@ -147,7 +147,7 @@ async fn completions_generate(
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Chat Completion request. Enable stream of token by setting `stream == true`
|
/// Chat Completion request. Enable stream of token by setting `stream == true`, (in Python use: pip install openai>=0.28.1)
|
||||||
#[utoipa::path(
|
#[utoipa::path(
|
||||||
post,
|
post,
|
||||||
tag = "Text Generation Inference",
|
tag = "Text Generation Inference",
|
||||||
|
|
Loading…
Reference in New Issue