From 94d243b3d7916879f1735d4a6f231e915765c1c4 Mon Sep 17 00:00:00 2001 From: Nicolas Patry Date: Thu, 1 Feb 2024 10:23:37 +0100 Subject: [PATCH] Freshen up the README. --- README.md | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 5fdb9f14..c4d84efa 100644 --- a/README.md +++ b/README.md @@ -28,7 +28,7 @@ to power Hugging Chat, the Inference API and Inference Endpoint. - [Local Install](#local-install) - [CUDA Kernels](#cuda-kernels) - [Optimized architectures](#optimized-architectures) -- [Run Falcon](#run-falcon) +- [Run Mistral](#run-a-model) - [Run](#run) - [Quantization](#quantization) - [Develop](#develop) @@ -42,7 +42,11 @@ Text Generation Inference (TGI) is a toolkit for deploying and serving Large Lan - Token streaming using Server-Sent Events (SSE) - Continuous batching of incoming requests for increased total throughput - Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures -- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) +- Quantization with : + - [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) + - [GPT-Q](https://arxiv.org/abs/2210.17323) + - [EETQ](https://github.com/NetEase-FuXi/EETQ) + - [AWQ](https://github.com/casper-hansen/AutoAWQ) - [Safetensors](https://github.com/huggingface/safetensors) weight loading - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) @@ -51,6 +55,14 @@ Text Generation Inference (TGI) is a toolkit for deploying and serving Large Lan - Custom Prompt Generation: Easily generate text by providing custom prompts to guide the model's output - Fine-tuning Support: Utilize fine-tuned models for specific tasks to achieve higher accuracy and performance +### Hardware support + +- [Nvidia](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) +- [AMD](https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference) (-rocm) +- [Inferentia](https://github.com/huggingface/optimum-neuron/tree/main/text-generation-inference) +- [Intel GPU](https://github.com/huggingface/text-generation-inference/pull/1475) +- [Gaudi](https://github.com/huggingface/tgi-gaudi) + ## Get Started @@ -154,7 +166,7 @@ Python 3.9, e.g. using `conda`: ```shell curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -conda create -n text-generation-inference python=3.9 +conda create -n text-generation-inference python=3.11 conda activate text-generation-inference ``` @@ -180,7 +192,7 @@ Then run: ```shell BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels -make run-falcon-7b-instruct +text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 ``` **Note:** on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run: @@ -189,13 +201,6 @@ make run-falcon-7b-instruct sudo apt-get install libssl-dev gcc -y ``` -### CUDA Kernels - -The custom CUDA kernels are only tested on NVIDIA A100, AMD MI210 and AMD MI250. If you have any installation or runtime issues, you can remove -the kernels by using the `DISABLE_CUSTOM_KERNELS=True` environment variable. - -Be aware that the official Docker image has them enabled by default. - ## Optimized architectures TGI works out of the box to serve optimized models for all modern models. They can be found in [this list](https://huggingface.co/docs/text-generation-inference/supported_models). @@ -210,12 +215,12 @@ or -## Run Falcon +## Run locally ### Run ```shell -make run-falcon-7b-instruct +text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 ``` ### Quantization @@ -223,7 +228,7 @@ make run-falcon-7b-instruct You can also quantize the weights with bitsandbytes to reduce the VRAM requirement: ```shell -make run-falcon-7b-instruct-quantize +text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize ``` 4bit quantization is available using the [NF4 and FP4 data types from bitsandbytes](https://arxiv.org/pdf/2305.14314.pdf). It can be enabled by providing `--quantize bitsandbytes-nf4` or `--quantize bitsandbytes-fp4` as a command line argument to `text-generation-launcher`.