Setup for doc-builder and docs for TGI (#740)

I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
2023-08-10 11:24:52 +03:00 · 2023-08-10 11:24:52 +03:00 · 647ae7a7d3
parent 0e8b47811e
commit 647ae7a7d3
12 changed files with 338 additions and 0 deletions
--- a/.github/workflows/build_documentation.yml
+++ b/.github/workflows/build_documentation.yml
@ -0,0 +1,19 @@
+name: Build documentation
+
+on:
+  push:
+    branches:
+      - main
+      - doc-builder*
+      - v*-release
+
+jobs:
+   build:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: text-generation-inference
+      additional_args: --not_python_module
+    secrets:
+      token: ${{ secrets.HUGGINGFACE_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
--- a/.github/workflows/build_pr_documentation.yml
+++ b/.github/workflows/build_pr_documentation.yml
@ -0,0 +1,17 @@
+name: Build PR Documentation
+
+on:
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    with:
+      commit_sha: ${{ github.event.pull_request.head.sha }}
+      pr_number: ${{ github.event.number }}
+      package: text-generation-inference
+      additional_args: --not_python_module 
--- a/.github/workflows/delete_doc_comment.yml
+++ b/.github/workflows/delete_doc_comment.yml
@ -0,0 +1,12 @@
+name: Delete doc comment
+
+on:
+  pull_request:
+    types: [ closed ]
+
+
+jobs:
+  delete:
+    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
+    with:
+      pr_number: ${{ github.event.number }}
--- a/.github/workflows/upload_pr_documentation.yml
+++ b/.github/workflows/upload_pr_documentation.yml
@ -0,0 +1,16 @@
+name: Upload PR Documentation
+
+on:
+  workflow_run:
+    workflows: ["Build PR Documentation"]
+    types:
+      - completed
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    with:
+      package_name: text-generation-inference
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@ -0,0 +1,18 @@
+- sections:
+  - local: index
+    title: Text Generation Inference
+  - local: quicktour
+    title: Quick Tour
+  - local: installation
+    title: Installation
+  - local: supported_models
+    title: Supported Models and Hardware
+  title: Getting started
+- sections:
+  - local: basic_tutorials/consuming_tgi
+    title: Consuming TGI
+  - local: basic_tutorials/preparing_model
+    title: Preparing Model for Serving
+  - local: basic_tutorials/gated_model_access
+    title: Serving Private & Gated Models
+  title: Tutorials
--- a/docs/source/basic_tutorials/consuming_tgi.md
+++ b/docs/source/basic_tutorials/consuming_tgi.md
@ -0,0 +1,57 @@
+# Consuming Text Generation Inference
+
+There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the `/generate` route and make a `POST` request to get results from the server. You can also use the `/generate_stream` route if you want TGI to return a stram of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.
+
+## curl
+
+After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
+
+```shell
+curl 127.0.0.1:8080/generate \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
+    -H 'Content-Type: application/json'
+```
+
+
+## Inference Client
+
+[`huggingface-hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a nice high-level class, [`~huggingface_hub.InferenceClient`], which makes it easy to make calls to a TGI endpoint. `InferenceClient` also takes care of parameter validation and provides a simple to-use interface.
+
+You can simply install `huggingface-hub` package with pip.
+
+```python
+pip install huggingface-hub
+```
+
+Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python. 
+
+```python
+
+from huggingface_hub import InferenceClient
+
+client = InferenceClient(model=URL_TO_ENDPOINT_SERVING_TGI)
+client.text_generation(prompt="Write a code for snake game", model=URL_TO_ENDPOINT_SERVING_TGI)
+```
+
+You can check out the details of the function [here](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation).
+
+
+## ChatUI
+
+ChatUI is an open-source interface built for LLM serving. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
+
+To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.
+
+```
+{
+// rest of the model config here
+"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
+}
+```
+
+TODO: Add screenshot
+
+## API documentation
+
+You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference). 
--- a/docs/source/basic_tutorials/gated_model_access.md
+++ b/docs/source/basic_tutorials/gated_model_access.md
@ -0,0 +1,5 @@
+# Serving Private & Gated Models
+
+If the model you wish to serve is behind gated access or the model repository on Hugging Face Hub is private, and you have access to the model, you can provide your Hugging Face Hub access token. You can generate and copy a read token from [Hugging Face Hub tokens page](https://huggingface.co/settings/tokens)
+
+If you're using the CLI, set the `HUGGING_FACE_HUB_TOKEN` environment variable.
--- a/docs/source/basic_tutorials/preparing_model.md
+++ b/docs/source/basic_tutorials/preparing_model.md
@ -0,0 +1,22 @@
+# Preparing the Model
+
+Text Generation Inference improves the model in several aspects. 
+
+## Quantization
+
+TGI supports [bits-and-bytes](https://github.com/TimDettmers/bitsandbytes#bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323) quantization. To speed up inference with quantization, simply set `quantize` flag to `bitsandbytes` or `gptq` depending on the quantization technique you wish to use. When using GPT-Q quantization, you need to point to one of the models [here](https://huggingface.co/models?search=gptq).
+
+
+## RoPE Scaling
+
+RoPE scaling can be used to increase the sequence length of the model during the inference time without necessarily fine-tuning it. To enable RoPE scaling, simply pass `--rope-scaling`, `--max-input-length` and `--rope-factors` flags when running through CLI. `--rope-scaling` can take the values `linear` or `dynamic`. If your model is not fine-tuned to a longer sequence length, use `dynamic`. `--rope-factor` is the ratio between the intended max sequence length and the model's original max sequence length. Make sure to pass `--max-input-length` to provide maximum input length for extension. 
+
+<Tip>
+
+We recommend using `dynamic` RoPE scaling.
+
+</Tip>
+
+## Safetensors
+
+[Safetensors](https://github.com/huggingface/safetensors) is a fast and safe persistence format for deep learning models, and is required for tensor parallelism. TGI supports `safetensors` model loading under the hood. By default, given a repository with `safetensors` and `pytorch` weights, TGI will always load `safetensors`. If there's no `pytorch` weights, TGI will convert the weights to `safetensors` format. 
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -0,0 +1,27 @@
+# Text Generation Inference
+
+Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and T5.
+
+![Text Generation Inference](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png)
+
+Text Generation Inference implements many optimizations and features, such as:
+
+- Simple launcher to serve most popular LLMs
+- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
+- Tensor Parallelism for faster inference on multiple GPUs
+- Token streaming using Server-Sent Events (SSE)
+- Continuous batching of incoming requests for increased total throughput
+- Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention) and [Paged Attention](https://github.com/vllm-project/vllm) on the most popular architectures
+- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) and [GPT-Q](https://arxiv.org/abs/2210.17323)
+- [Safetensors](https://github.com/huggingface/safetensors) weight loading
+- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
+- Logits warper (temperature scaling, top-p, top-k, repetition penalty)
+- Stop sequences
+- Log probabilities
+
+
+Text Generation Inference is used in production by multiple projects, such as:
+
+- [Hugging Chat](https://github.com/huggingface/chat-ui), an open-source interface for open-access models, such as Open Assistant and Llama
+- [OpenAssistant](https://open-assistant.io/), an open-source community effort to train LLMs in the open
+- [nat.dev](http://nat.dev/), a playground to explore and compare LLMs.
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@ -0,0 +1,71 @@
+# Installation
+
+This section explains how to install the CLI tool as well as installing TGI from source. **The strongly recommended approach is to use Docker, as it does not require much setup. Check [the Quick Tour](./quicktour) to learn how to run TGI with Docker.**
+
+## Install CLI
+
+TODO
+
+
+## Local Installation from Source
+
+Before you start, you will need to setup your environment, and install Text Generation Inference. Text Generation Inference is tested on **Python 3.9+**.
+
+Text Generation Inference is available on pypi, conda and GitHub. 
+
+To install and launch locally, first [install Rust](https://rustup.rs/) and create a Python virtual environment with at least
+Python 3.9, e.g. using conda:
+
+```shell
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+
+conda create -n text-generation-inference python=3.9
+conda activate text-generation-inference
+```
+
+You may also need to install Protoc.
+
+On Linux:
+
+```shell
+PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
+curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
+sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
+sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
+rm -f $PROTOC_ZIP
+```
+
+On MacOS, using Homebrew:
+
+```shell
+brew install protobuf
+```
+
+Then run to install Text Generation Inference:
+
+```shell
+BUILD_EXTENSIONS=True make install # Install repository and HF/transformer fork with CUDA kernels
+```
+
+<Tip warning={true}>
+
+On some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:
+
+```shell
+sudo apt-get install libssl-dev gcc -y
+```
+
+</Tip>
+
+Once installation is done, simply run:
+
+```shell
+make run-falcon-7b-instruct
+```
+
+This will serve Falcon 7B Instruct model from the port 8080, which we can query.
+
+To see all options to serve your models, check in the [codebase](https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs) or the CLI:
+```
+text-generation-launcher --help
+```
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@ -0,0 +1,34 @@
+# Quick Tour
+
+The easiest way of getting started is using the official Docker container. Install Docker following [their installation instructions](https://docs.docker.com/get-docker/).
+
+Let's say you want to deploy [Falcon-7B Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) model with TGI. Here is an example on how to do that:
+
+```shell
+model=tiiuae/falcon-7b-instruct
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+
+docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.0 --model-id $model
+```
+
+<Tip warning={true}>
+
+To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)  . We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
+
+</Tip>
+
+Once TGI is running, you can use the `generate` endpoint by doing requests. To learn more about how to query the endpoints, check the [Consuming TGI](./basic_tutorials/consuming_tgi) section.
+
+```shell
+curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
+```
+
+<Tip>
+
+To see all possible flags and options, you can use the `--help` flag. It's possible to configure the number of shards, quantization, generation parameters, and more.
+
+```shell
+docker run ghcr.io/huggingface/text-generation-inference:1.0.0 --help
+```
+
+</Tip>
--- a/docs/source/supported_models.md
+++ b/docs/source/supported_models.md
@ -0,0 +1,40 @@
+# Supported Models and Hardware
+
+Text Generation Inference enables serving optimized models on specific hardware for the highest performance. The following sections list which models are hardware are supported.
+
+## Supported Models
+
+The following models are optimized and can be served with TGI, which uses custom CUDA kernels for better inference. You can add the flag `--disable-custom-kernels` at the end of the `docker run` command if you wish to disable them.
+
+- [BLOOM](https://huggingface.co/bigscience/bloom)
+- [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
+- [Galactica](https://huggingface.co/facebook/galactica-120b)
+- [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
+- [Llama](https://github.com/facebookresearch/llama)
+- [OPT](https://huggingface.co/facebook/opt-66b)
+- [SantaCoder](https://huggingface.co/bigcode/santacoder)
+- [Starcoder](https://huggingface.co/bigcode/starcoder)
+- [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
+- [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
+- [MPT](https://huggingface.co/mosaicml/mpt-30b)
+- [Llama V2](https://huggingface.co/meta-llama)
+
+If the above list lacks the model you would like to serve, depending on the model's pipeline type, you can try to initialize and serve the model anyways to see how well it performs, but performance isn't guaranteed for non-optimized models:
+
+```python
+# for causal LMs/text-generation models
+AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")`
+# or, for text-to-text generation models
+AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")
+```
+
+
+## Supported Hardware
+
+TGI optimized models are supported on NVIDIA [A100](https://www.nvidia.com/en-us/data-center/a100/), [A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) and [T4](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPUs with CUDA 11.8+. Note that you have to install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to use it. For other hardware, continuous batching will still apply, but some operations like flash attention and paged attention will not be executed. 
+
+TGI is also supported on the following AI hardware accelerators:
+- *Habana first-gen Gaudi and Gaudi2:* check out this [example](https://github.com/huggingface/optimum-habana/tree/main/text-generation-inference) how to serve models with TGI on Gaudi and Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index)
+
+
+