Update documentation for Supported models (#2386)

* Minor doc fixes

* up.

* Other minor updates.
This commit is contained in:
Vaibhav Srivastav 2024-08-09 15:01:34 +02:00 committed by GitHub
parent 977534bcb8
commit b2b9c42724
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 41 additions and 15 deletions

View File

@ -42,6 +42,7 @@ Text Generation Inference (TGI) is a toolkit for deploying and serving Large Lan
- Tensor Parallelism for faster inference on multiple GPUs - Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE) - Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput - Continuous batching of incoming requests for increased total throughput
- [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) compatible with Open AI Chat Completion API
- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures - Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
- Quantization with : - Quantization with :
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
@ -49,7 +50,7 @@ Text Generation Inference (TGI) is a toolkit for deploying and serving Large Lan
- [EETQ](https://github.com/NetEase-FuXi/EETQ) - [EETQ](https://github.com/NetEase-FuXi/EETQ)
- [AWQ](https://github.com/casper-hansen/AutoAWQ) - [AWQ](https://github.com/casper-hansen/AutoAWQ)
- [Marlin](https://github.com/IST-DASLab/marlin) - [Marlin](https://github.com/IST-DASLab/marlin)
- [fp8]() - [fp8](https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/)
- [Safetensors](https://github.com/huggingface/safetensors) weight loading - [Safetensors](https://github.com/huggingface/safetensors) weight loading
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
@ -94,6 +95,29 @@ curl 127.0.0.1:8080/generate_stream \
-H 'Content-Type: application/json' -H 'Content-Type: application/json'
``` ```
You can also use [TGI's Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) to obtain Open AI Chat Completion API compatible responses.
```bash
curl localhost:3000/v1/chat/completions \
-X POST \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is deep learning?"
}
],
"stream": true,
"max_tokens": 20
}' \
-H 'Content-Type: application/json'
```
**Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar. **Note:** To use NVIDIA GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). We also recommend using NVIDIA drivers with CUDA version 12.2 or higher. For running the Docker container on a machine with no GPUs or CUDA support, it is enough to remove the `--gpus all` flag and add `--disable-custom-kernels`, please note CPU is not the intended platform for this project, so performance might be subpar.
**Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.2.0-rocm --model-id $model` instead of the command above. **Note:** TGI supports AMD Instinct MI210 and MI250 GPUs. Details can be found in the [Supported Hardware documentation](https://huggingface.co/docs/text-generation-inference/supported_models#supported-hardware). To use AMD GPUs, please use `docker run --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.2.0-rocm --model-id $model` instead of the command above.
@ -122,7 +146,7 @@ For example, if you want to serve the gated Llama V2 model variants:
or with Docker: or with Docker:
```shell ```shell
model=meta-llama/Llama-2-7b-chat-hf model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token> token=<your cli READ token>
@ -234,7 +258,7 @@ text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
### Quantization ### Quantization
You can also quantize the weights with bitsandbytes to reduce the VRAM requirement: You can also run pre-quantized weights (AWQ, GPTQ, Marlin) or on-the-fly quantize weights with bitsandbytes, EETQ, fp8, to reduce the VRAM requirement:
```shell ```shell
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize
@ -242,6 +266,8 @@ text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantiz
4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`. 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.
Read more about quantization in the [Quantization documentation](https://huggingface.co/docs/text-generation-inference/en/conceptual/quantization).
## Develop ## Develop
```shell ```shell

View File

@ -11,7 +11,7 @@ We recommend using the official quantization scripts for creating your quants:
For on-the-fly quantization you simply need to pass one of the supported quantization types and TGI takes care of the rest. For on-the-fly quantization you simply need to pass one of the supported quantization types and TGI takes care of the rest.
## Quantization with bitsandbytes ## Quantization with bitsandbytes, EETQ & fp8
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision. bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn't require a calibration dataset or any post-processing weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
@ -32,7 +32,7 @@ docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingf
You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). You can get more information about 8-bit quantization by reading this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), and 4-bit quantization by reading [this blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
Use `eetq` or `fp8` for other quantization schemes. Similarly you can use pass you can pass `--quantize eetq` or `--quantize fp8` for respective quantization schemes.
In addition to this, TGI allows creating GPTQ quants directly by passing the model weights and a calibration dataset. In addition to this, TGI allows creating GPTQ quants directly by passing the model weights and a calibration dataset.

View File

@ -21,7 +21,7 @@ TGI supports various hardware. Make sure to check the [Using TGI with Nvidia GPU
## Consuming TGI ## Consuming TGI
Once TGI is running, you can use the `generate` endpoint by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint. Once TGI is running, you can use the `generate` endpoint or the Open AI Chat Completion API compatible [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section, where we show examples with utility libraries and UIs. Below you can see a simple snippet to query the endpoint.
<inferencesnippet> <inferencesnippet>
<python> <python>

View File

@ -1,22 +1,22 @@
# Supported Models and Hardware # Supported Models and Hardware
Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models are hardware are supported. Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models (VLMs & LLMs) are supported.
## Supported Models ## Supported Models
- [Deepseek V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) - [Deepseek V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)
- [Idefics 2](https://huggingface.co/HuggingFaceM4/idefics2-8b) (Multimodal) - [Idefics 2](https://huggingface.co/HuggingFaceM4/idefics2-8b) (Multimodal)
- [Llava Next (1.6)](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) (Multimodal) - [Llava Next (1.6)](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) (Multimodal)
- [Llama](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) - [Llama](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f)
- [Phi 3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) - [Phi 3](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)
- [Gemma](https://huggingface.co/google/gemma-7b) - [Gemma](https://huggingface.co/google/gemma-7b)
- [PaliGemma](https://huggingface.co/google/paligemma-3b-pt-224) - [PaliGemma](https://huggingface.co/google/paligemma-3b-pt-224)
- [Gemma2](https://huggingface.co/google/gemma2-9b) - [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)
- [Cohere](https://huggingface.co/CohereForAI/c4ai-command-r-plus) - [Cohere](https://huggingface.co/CohereForAI/c4ai-command-r-plus)
- [Dbrx](https://huggingface.co/databricks/dbrx-instruct) - [Dbrx](https://huggingface.co/databricks/dbrx-instruct)
- [Mamba](https://huggingface.co/state-spaces/mamba-2.8b-slimpj) - [Mamba](https://huggingface.co/state-spaces/mamba-2.8b-slimpj)
- [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) - [Mistral](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
- [Mixtral](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1) - [Mixtral](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
- [Gpt Bigcode](https://huggingface.co/bigcode/gpt_bigcode-santacoder) - [Gpt Bigcode](https://huggingface.co/bigcode/gpt_bigcode-santacoder)
- [Phi](https://huggingface.co/microsoft/phi-1_5) - [Phi](https://huggingface.co/microsoft/phi-1_5)

View File

@ -180,7 +180,7 @@ class ModelType(enum.Enum):
LLAMA = { LLAMA = {
"type": "llama", "type": "llama",
"name": "Llama", "name": "Llama",
"url": "https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct", "url": "https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f",
} }
PHI3 = { PHI3 = {
"type": "phi3", "type": "phi3",
@ -200,7 +200,7 @@ class ModelType(enum.Enum):
GEMMA2 = { GEMMA2 = {
"type": "gemma2", "type": "gemma2",
"name": "Gemma2", "name": "Gemma2",
"url": "https://huggingface.co/google/gemma2-9b", "url": "https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315",
} }
COHERE = { COHERE = {
"type": "cohere", "type": "cohere",
@ -220,7 +220,7 @@ class ModelType(enum.Enum):
MISTRAL = { MISTRAL = {
"type": "mistral", "type": "mistral",
"name": "Mistral", "name": "Mistral",
"url": "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2", "url": "https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407",
} }
MIXTRAL = { MIXTRAL = {
"type": "mixtral", "type": "mixtral",

View File

@ -7,7 +7,7 @@ import os
TEMPLATE = """ TEMPLATE = """
# Supported Models and Hardware # Supported Models and Hardware
Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models are hardware are supported. Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models (VLMs & LLMs) are supported.
## Supported Models ## Supported Models