preemo_text-generation-infe.../README.md

# Text Generation Inference

This is Preemo's fork of `text-generation-inference`, originally developed by Hugging Face. The original README is at [README-HuggingFace.md](README-HuggingFace.md). Since Hugging Face's `text-generation-inference` is no longer open-source, we have forked it and will continue to develop it here.


Our goal is to create an open-source text generation inference server that is modularized to allow for easy add state-of-the-art models, functionalities and optimizations. Functionalities and optimizations should be composable, so that users can easily combine them to create a custom inference server that fits their needs.

## our plan

We at Preemo are currently busy working on our first release of our other product, so we expect to be able to start open-source development on this repository in September 2023. We will be working on the following, to ease the external contributions:

- [ ] Adding a public visible CI/CD pipeline that runs tests and builds docker images
- [ ] Unifying the build tools
- [ ] Modularizing the codebase, introducing a plugin system

Our long-term goal is to grow the community around this repository, as a playground for trying out new ideas and optimizations in LLM inference. We at Preemo will implement features that interest us, but we also welcome contributions from the community, as long as they are modularized and composable.

## Extra features in comparison to Hugging Face `text-generation-inference` v0.9.4

### 4bit quantization

4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.

### CTranslate2

Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.

### Chat Completions in OpenAI Format

`/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI.
You may set the `TGICHAT_(USER|ASS|SYS)_(PRE|POST)` environment variables, to wrap the chat messages.

<details>
  <summary>Optimal Llama-2-Chat config</summary>
  For Llama-2, you should wrap each chat message with a different strings, depending on the role.
  Supported roles are `assistant`, `user`, `system`.
  
  ```bash
  TGICHAT_USER_PRE=" [INST] "
  TGICHAT_USER_POST=" [\\INST] "
  TGICHAT_ASS_PRE=""
  TGICHAT_ASS_POST=""
  TGICHAT_SYS_PRE=" [INST] <<SYS>> "
  TGICHAT_SYS_POST=" <</SYS>> [\\INST] "
  ```

  Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token.
  
</details>

## Get started with Docker

```bash
model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
image=docker.io/michaelf34/tgi:05-11-2023 # docker image by @michaelfeil

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2
```

To see all options of `text-generation-launcher` you may use the `--help` command: 
```bash
docker run $image --help
```
feat: Use json formatter by default in docker image 2022-11-02 10:29:56 -06:00			`# Text Generation Inference`
Init 2022-10-08 04:30:12 -06:00
Add a new README (#3) * Rename README.md to README-HuggingFace.md * Add Preemo's README 2023-08-01 13:22:07 -06:00			This is Preemo's fork of `text-generation-inference`, originally developed by Hugging Face. The original README is at [README-HuggingFace.md](README-HuggingFace.md). Since Hugging Face's `text-generation-inference` is no longer open-source, we have forked it and will continue to develop it here.
feat(server): Support bitsandbytes 2022-10-27 06:25:29 -06:00

Add a new README (#3) * Rename README.md to README-HuggingFace.md * Add Preemo's README 2023-08-01 13:22:07 -06:00			`Our goal is to create an open-source text generation inference server that is modularized to allow for easy add state-of-the-art models, functionalities and optimizations. Functionalities and optimizations should be composable, so that users can easily combine them to create a custom inference server that fits their needs.`
restoring commit from dev branch, rebase on current master 2023-08-01 10:15:18 -06:00
Add a new README (#3) * Rename README.md to README-HuggingFace.md * Add Preemo's README 2023-08-01 13:22:07 -06:00			`## our plan`
v0.1.0 2022-10-18 07:19:03 -06:00
Add a new README (#3) * Rename README.md to README-HuggingFace.md * Add Preemo's README 2023-08-01 13:22:07 -06:00			`We at Preemo are currently busy working on our first release of our other product, so we expect to be able to start open-source development on this repository in September 2023. We will be working on the following, to ease the external contributions:`
Init 2022-10-08 04:30:12 -06:00
Add a new README (#3) * Rename README.md to README-HuggingFace.md * Add Preemo's README 2023-08-01 13:22:07 -06:00			`- [ ] Adding a public visible CI/CD pipeline that runs tests and builds docker images`
			`- [ ] Unifying the build tools`
			`- [ ] Modularizing the codebase, introducing a plugin system`
feat(server): Use safetensors Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> 2022-10-22 12:00:15 -06:00
Add a new README (#3) * Rename README.md to README-HuggingFace.md * Add Preemo's README 2023-08-01 13:22:07 -06:00			`Our long-term goal is to grow the community around this repository, as a playground for trying out new ideas and optimizations in LLM inference. We at Preemo will implement features that interest us, but we also welcome contributions from the community, as long as they are modularized and composable.`
Add section about TGI on other AI hardware accelerators in README (#715) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> As per title. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> 2023-07-28 01:14:03 -06:00
Update README.md 2023-08-03 15:23:02 -06:00			## Extra features in comparison to Hugging Face `text-generation-inference` v0.9.4
Add section about TGI on other AI hardware accelerators in README (#715) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> As per title. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> 2023-07-28 01:14:03 -06:00
Merge branch 'main' into bnb_4bit 2023-08-02 13:47:17 -06:00			`### 4bit quantization`
Add section about TGI on other AI hardware accelerators in README (#715) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> As per title. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> 2023-07-28 01:14:03 -06:00
Update README.md 2023-08-03 15:23:02 -06:00			4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
Update Readme.md / documentation (#15) * add documentation updates * update readme * Update README.md 2023-10-04 00:01:06 -06:00
			`### CTranslate2`

			Int8 Ctranslate2 quantization is available using the `--quantize ct2` as a command line argument to `text-generation-launcher`. It will convert the PyTorch Model provided in `--model-id` on the fly, and save the quantized model for the next start-up for up to 10x faster loading times. If CUDA is not available, Ctranslate2 will default to run on CPU.

			`### Chat Completions in OpenAI Format`

			`/chat/completions` and `/completions` endpoints are available, using the API schema commonly known from OpenAI.
			You may set the `TGICHAT_(USER\|ASS\|SYS)_(PRE\|POST)` environment variables, to wrap the chat messages.

			`<details>`
			`<summary>Optimal Llama-2-Chat config</summary>`
			`For Llama-2, you should wrap each chat message with a different strings, depending on the role.`
			Supported roles are `assistant`, `user`, `system`.

			```bash
			`TGICHAT_USER_PRE=" [INST] "`
			`TGICHAT_USER_POST=" [\\INST] "`
			`TGICHAT_ASS_PRE=""`
			`TGICHAT_ASS_POST=""`
			`TGICHAT_SYS_PRE=" [INST] <<SYS>> "`
			`TGICHAT_SYS_POST=" <</SYS>> [\\INST] "`
			```

			Note: To access a gated model, you may need to set: `HUGGING_FACE_HUB_TOKEN` for your access token.

			`</details>`

			`## Get started with Docker`

			```bash
			`model=TheBloke/Llama-2-13B-Chat-fp16 # around 14GB Vram.`
			`volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run`
update causal batch for ct2 and fix nf4 (#17) * update causal batch for ct2 and fix nf4 * bump the ctranslate2 version --------- Co-authored-by: Michael Feil <michael.feil@michaelfeil.eu> 2024-02-09 12:07:14 -07:00			`image=docker.io/michaelf34/tgi:05-11-2023 # docker image by @michaelfeil`
Update Readme.md / documentation (#15) * add documentation updates * update readme * Update README.md 2023-10-04 00:01:06 -06:00
			`docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data $image --model-id $model --quantize ct2`
			```

			To see all options of `text-generation-launcher` you may use the `--help` command:
			```bash
			`docker run $image --help`
			```