hf_text-generation-inference/docs/source/basic_tutorials/consuming_tgi.md

# Consuming Text Generation Inference

There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the `/generate` route and make a `POST` request to get results from the server. You can also use the `/generate_stream` route if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.

## curl

After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:

```bash
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'
```


## Inference Client

[`huggingface-hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a nice high-level class, [`~huggingface_hub.InferenceClient`], which makes it easy to make calls to a TGI endpoint. `InferenceClient` also takes care of parameter validation and provides a simple to-use interface.
You can simply install `huggingface-hub` package with pip.

```bash
pip install huggingface-hub
```

Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.

```python
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")
client.text_generation(prompt="Write a code for snake game")
```

You can do streaming with `InferenceClient` by passing `stream=True`. Streaming will return tokens as they are being generated in the server. To use streaming, you can do as follows:

```python
for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
    print(token)
```

Another parameter you can use with TGI backend is `details`. You can get more details on generation (tokens, probabilities, etc.) by setting `details` to `True`. When it's specified, TGI will return a `TextGenerationResponse` or `TextGenerationStreamResponse` rather than a string or stream.

```python
output = client.text_generation(prompt="Meaning of life is", details=True)
print(output)

# TextGenerationResponse(generated_text=' a complex concept that is not always clear to the individual. It is a concept that is not always', details=Details(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=20, seed=None, prefill=[], tokens=[Token(id=267, text=' a', logprob=-2.0723474, special=False), Token(id=11235, text=' complex', logprob=-3.1272552, special=False), Token(id=17908, text=' concept', logprob=-1.3632495, special=False),..))
```

You can see how to stream below.

```python
output = client.text_generation(prompt="Meaning of life is", stream=True, details=True)
print(next(iter(output)))

# TextGenerationStreamResponse(token=Token(id=267, text=' a', logprob=-2.0723474, special=False), generated_text=None, details=None)
```

You can check out the details of the function [here](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation). There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)


## ChatUI

ChatUI is an open-source interface built for LLM serving. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.

To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.

```
{
// rest of the model config here
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
}
```

![ChatUI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chatui_screen.png)

## Gradio

Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.

```bash
pip install huggingface-hub gradio
```

Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).

```python
import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:8080")

def inference(message, history):
    partial_message = ""
    for token in client.text_generation(message, max_new_tokens=20, stream=True):
        partial_message += token
        yield partial_message

gr.ChatInterface(
    inference,
    chatbot=gr.Chatbot(height=300),
    textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),
    description="This is the demo for Gradio UI consuming TGI endpoint with LLaMA 7B-Chat model.",
    title="Gradio 🤝 TGI",
    examples=["Are tomatoes vegetables?"],
    retry_btn="Retry",
    undo_btn="Undo",
    clear_btn="Clear",
).queue().launch()
```

The UI looks like this 👇

<div class="flex justify-center">
    <img
        class="block dark:hidden"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi.png"
    />
    <img
        class="hidden dark:block"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi-dark.png"
    />
</div>

You can try the demo directly here 👇

<div class="block dark:hidden">
	<iframe
        src="https://merve-gradio-tgi-2.hf.space?__theme=light"
        width="850"
        height="750"
    ></iframe>
</div>
<div class="hidden dark:block">
    <iframe
        src="https://merve-gradio-tgi-2.hf.space?__theme=dark"
        width="850"
        height="750"
    ></iframe>
</div>


You can disable streaming mode using `return` instead of `yield` in your inference function, like below.

```python
def inference(message, history):
    return client.text_generation(message, max_new_tokens=20)
```

You can read more about how to customize a `ChatInterface` [here](https://www.gradio.app/guides/creating-a-chatbot-fast).

## API documentation

You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00			`# Consuming Text Generation Inference`

Added streaming for InferenceClient (#821) Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-11 09:05:19 -06:00			There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the `/generate` route and make a `POST` request to get results from the server. You can also use the `/generate_stream` route if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00
			`## curl`

			After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:

Minor docs style fixes (#806) 2023-08-10 06:32:51 -06:00			```bash
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00			`curl 127.0.0.1:8080/generate \`
			`-X POST \`
			`-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \`
			`-H 'Content-Type: application/json'`
			```


			`## Inference Client`

			[`huggingface-hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a nice high-level class, [`~huggingface_hub.InferenceClient`], which makes it easy to make calls to a TGI endpoint. `InferenceClient` also takes care of parameter validation and provides a simple to-use interface.
			You can simply install `huggingface-hub` package with pip.

Minor docs style fixes (#806) 2023-08-10 06:32:51 -06:00			```bash
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00			`pip install huggingface-hub`
			```

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00
			```python
			`from huggingface_hub import InferenceClient`

Misc improvements for InferenceClient docs (#852) List of changes - No need to specify `model` in `text_generation` if it's already specified in `InferenceClient` - I separated the explanation of `stream=True` and `details=True` - I found the details explanation a bit repetitive (it says two times what it returns), so removed a sentence - Add mention of async client 2023-08-16 06:29:54 -06:00			`client = InferenceClient(model="http://127.0.0.1:8080")`
			`client.text_generation(prompt="Write a code for snake game")`
			```

			You can do streaming with `InferenceClient` by passing `stream=True`. Streaming will return tokens as they are being generated in the server. To use streaming, you can do as follows:

			```python
			`for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):`
			`print(token)`
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00			```

Misc improvements for InferenceClient docs (#852) List of changes - No need to specify `model` in `text_generation` if it's already specified in `InferenceClient` - I separated the explanation of `stream=True` and `details=True` - I found the details explanation a bit repetitive (it says two times what it returns), so removed a sentence - Add mention of async client 2023-08-16 06:29:54 -06:00			Another parameter you can use with TGI backend is `details`. You can get more details on generation (tokens, probabilities, etc.) by setting `details` to `True`. When it's specified, TGI will return a `TextGenerationResponse` or `TextGenerationStreamResponse` rather than a string or stream.
Added streaming for InferenceClient (#821) Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-11 09:05:19 -06:00
			```python
Misc improvements for InferenceClient docs (#852) List of changes - No need to specify `model` in `text_generation` if it's already specified in `InferenceClient` - I separated the explanation of `stream=True` and `details=True` - I found the details explanation a bit repetitive (it says two times what it returns), so removed a sentence - Add mention of async client 2023-08-16 06:29:54 -06:00			`output = client.text_generation(prompt="Meaning of life is", details=True)`
Added streaming for InferenceClient (#821) Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-11 09:05:19 -06:00			`print(output)`

			`# TextGenerationResponse(generated_text=' a complex concept that is not always clear to the individual. It is a concept that is not always', details=Details(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=20, seed=None, prefill=[], tokens=[Token(id=267, text=' a', logprob=-2.0723474, special=False), Token(id=11235, text=' complex', logprob=-3.1272552, special=False), Token(id=17908, text=' concept', logprob=-1.3632495, special=False),..))`
			```

			`You can see how to stream below.`

			```python
Misc improvements for InferenceClient docs (#852) List of changes - No need to specify `model` in `text_generation` if it's already specified in `InferenceClient` - I separated the explanation of `stream=True` and `details=True` - I found the details explanation a bit repetitive (it says two times what it returns), so removed a sentence - Add mention of async client 2023-08-16 06:29:54 -06:00			`output = client.text_generation(prompt="Meaning of life is", stream=True, details=True)`
Added streaming for InferenceClient (#821) Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-11 09:05:19 -06:00			`print(next(iter(output)))`

			`# TextGenerationStreamResponse(token=Token(id=267, text=' a', logprob=-2.0723474, special=False), generated_text=None, details=None)`
			```

Misc improvements for InferenceClient docs (#852) List of changes - No need to specify `model` in `text_generation` if it's already specified in `InferenceClient` - I separated the explanation of `stream=True` and `details=True` - I found the details explanation a bit repetitive (it says two times what it returns), so removed a sentence - Add mention of async client 2023-08-16 06:29:54 -06:00			You can check out the details of the function [here](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation). There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00

			`## ChatUI`

			`ChatUI is an open-source interface built for LLM serving. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.`

			To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.

			```
			`{`
			`// rest of the model config here`
			`"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]`
			`}`
			```

Added ChatUI Screenshot to Docs (#823) cc @osanseviero 2023-08-11 08:42:43 -06:00			`![ChatUI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chatui_screen.png)`
Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00			`## Gradio`

			Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.

			```bash
			`pip install huggingface-hub gradio`
			```

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`Assume you are serving your model on port 8080, we will query through [InferenceClient](consuming_tgi#inference-client).`

Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00			```python
			`import gradio as gr`
			`from huggingface_hub import InferenceClient`

			`client = InferenceClient(model="http://127.0.0.1:8080")`

			`def inference(message, history):`
			`partial_message = ""`
			`for token in client.text_generation(message, max_new_tokens=20, stream=True):`
			`partial_message += token`
			`yield partial_message`

			`gr.ChatInterface(`
			`inference,`
			`chatbot=gr.Chatbot(height=300),`
			`textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),`
			`description="This is the demo for Gradio UI consuming TGI endpoint with LLaMA 7B-Chat model.",`
			`title="Gradio 🤝 TGI",`
			`examples=["Are tomatoes vegetables?"],`
			`retry_btn="Retry",`
			`undo_btn="Undo",`
			`clear_btn="Clear",`
			`).queue().launch()`
			```

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`The UI looks like this 👇`
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00
			`<div class="flex justify-center">`
chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`<img`
			`class="block dark:hidden"`
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00			`src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi.png"`
			`/>`
chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`<img`
			`class="hidden dark:block"`
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00			`src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi-dark.png"`
			`/>`
			`</div>`

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`You can try the demo directly here 👇`
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00
			`<div class="block dark:hidden">`
chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`<iframe`
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00			`src="https://merve-gradio-tgi-2.hf.space?__theme=light"`
			`width="850"`
			`height="750"`
			`></iframe>`
			`</div>`
			`<div class="hidden dark:block">`
chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			`<iframe`
Added gradio example to docs (#867) cc @osanseviero --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> 2023-08-23 15:50:12 -06:00			`src="https://merve-gradio-tgi-2.hf.space?__theme=dark"`
			`width="850"`
			`height="750"`
			`></iframe>`
			`</div>`


			You can disable streaming mode using `return` instead of `yield` in your inference function, like below.

			```python
			`def inference(message, history):`
			`return client.text_generation(message, max_new_tokens=20)`
			```

			You can read more about how to customize a `ChatInterface` [here](https://www.gradio.app/guides/creating-a-chatbot-fast).

Setup for doc-builder and docs for TGI (#740) I added ToC for docs v1 & started setting up for doc-builder. cc @Narsil @osanseviero --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: osanseviero <osanseviero@gmail.com> Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu> 2023-08-10 02:24:52 -06:00			`## API documentation`

chore: add pre-commit (#1569) 2024-02-16 03:58:58 -07:00			You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).