Improve the Consuming TGI + Streaming docs. (#2412)
* Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by: Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
This commit is contained in:
parent
1411bfb989
commit
99b662f8c2
|
@ -1,81 +1,125 @@
|
||||||
# Consuming Text Generation Inference
|
# Consuming Text Generation Inference
|
||||||
|
|
||||||
There are many ways you can consume Text Generation Inference server in your applications. After launching, you can use the `/generate` route and make a `POST` request to get results from the server. You can also use the `/generate_stream` route if you want TGI to return a stream of tokens. You can make the requests using the tool of your preference, such as curl, Python or TypeScrpt. For a final end-to-end experience, we also open-sourced ChatUI, a chat interface for open-source models.
|
There are many ways to consume Text Generation Inference (TGI) server in your applications. After launching the server, you can use the [Messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) `/v1/chat/completions` route and make a `POST` request to get results from the server. You can also pass `"stream": true` to the call if you want TGI to return a stream of tokens.
|
||||||
|
|
||||||
|
For more information on the API, consult the OpenAPI documentation of `text-generation-inference` available [here](https://huggingface.github.io/text-generation-inference).
|
||||||
|
|
||||||
|
You can make the requests using any tool of your preference, such as curl, Python, or TypeScript. For an end-to-end experience, we've open-sourced [ChatUI](https://github.com/huggingface/chat-ui), a chat interface for open-access models.
|
||||||
|
|
||||||
## curl
|
## curl
|
||||||
|
|
||||||
After the launch, you can query the model using either the `/generate` or `/generate_stream` routes:
|
After a successful server launch, you can query the model using the `v1/chat/completions` route, to get responses that are compliant to the OpenAI Chat Completion spec:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl localhost:8080/v1/chat/completions \
|
||||||
|
-X POST \
|
||||||
|
-d '{
|
||||||
|
"model": "tgi",
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are a helpful assistant."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "What is deep learning?"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"stream": true,
|
||||||
|
"max_tokens": 20
|
||||||
|
}' \
|
||||||
|
-H 'Content-Type: application/json'
|
||||||
|
```
|
||||||
|
|
||||||
|
For non-chat use-cases, you can also use the `/generate` and `/generate_stream` routes.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl 127.0.0.1:8080/generate \
|
curl 127.0.0.1:8080/generate \
|
||||||
-X POST \
|
-X POST \
|
||||||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
-d '{
|
||||||
|
"inputs":"What is Deep Learning?",
|
||||||
|
"parameters":{
|
||||||
|
"max_new_tokens":20
|
||||||
|
}
|
||||||
|
}' \
|
||||||
-H 'Content-Type: application/json'
|
-H 'Content-Type: application/json'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Python
|
||||||
|
|
||||||
## Inference Client
|
### Inference Client
|
||||||
|
|
||||||
[`huggingface-hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a nice high-level class, [`~huggingface_hub.InferenceClient`], which makes it easy to make calls to a TGI endpoint. `InferenceClient` also takes care of parameter validation and provides a simple to-use interface.
|
[`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/main/en/index) is a Python library to interact with the Hugging Face Hub, including its endpoints. It provides a high-level class, [`huggingface_hub.InferenceClient`](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient), which makes it easy to make calls to TGI's Messages API. `InferenceClient` also takes care of parameter validation and provides a simple-to-use interface.
|
||||||
You can simply install `huggingface-hub` package with pip.
|
|
||||||
|
Install `huggingface_hub` package via pip.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install huggingface-hub
|
pip install huggingface_hub
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you start the TGI server, instantiate `InferenceClient()` with the URL to the endpoint serving the model. You can then call `text_generation()` to hit the endpoint through Python.
|
You can now use `InferenceClient` the exact same way you would use `OpenAI` client in Python
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from huggingface_hub import InferenceClient
|
from huggingface_hub import InferenceClient
|
||||||
|
|
||||||
client = InferenceClient(model="http://127.0.0.1:8080")
|
client = InferenceClient(
|
||||||
client.text_generation(prompt="Write a code for snake game")
|
base_url="http://localhost:8080/v1/",
|
||||||
|
)
|
||||||
|
|
||||||
|
output = client.chat.completions.create(
|
||||||
|
model="tgi",
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
|
{"role": "user", "content": "Count to 10"},
|
||||||
|
],
|
||||||
|
stream=True,
|
||||||
|
max_tokens=1024,
|
||||||
|
)
|
||||||
|
|
||||||
|
for chunk in output:
|
||||||
|
print(chunk.choices[0].delta.content)
|
||||||
```
|
```
|
||||||
|
|
||||||
You can do streaming with `InferenceClient` by passing `stream=True`. Streaming will return tokens as they are being generated in the server. To use streaming, you can do as follows:
|
You can check out more details about OpenAI compatibility [here](https://huggingface.co/docs/huggingface_hub/en/guides/inference#openai-compatibility).
|
||||||
|
|
||||||
|
There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
|
||||||
|
|
||||||
|
### OpenAI Client
|
||||||
|
|
||||||
|
You can directly use the OpenAI [Python](https://github.com/openai/openai-python) or [JS](https://github.com/openai/openai-node) clients to interact with TGI.
|
||||||
|
|
||||||
|
Install the OpenAI Python package via pip.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install openai
|
||||||
|
```
|
||||||
|
|
||||||
```python
|
```python
|
||||||
for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
|
from openai import OpenAI
|
||||||
print(token)
|
|
||||||
|
# init the client but point it to TGI
|
||||||
|
client = OpenAI(
|
||||||
|
base_url="http://localhost:8080/v1/",
|
||||||
|
api_key="-"
|
||||||
|
)
|
||||||
|
|
||||||
|
chat_completion = client.chat.completions.create(
|
||||||
|
model="tgi",
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": "You are a helpful assistant." },
|
||||||
|
{"role": "user", "content": "What is deep learning?"}
|
||||||
|
],
|
||||||
|
stream=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# iterate and print stream
|
||||||
|
for message in chat_completion:
|
||||||
|
print(message)
|
||||||
```
|
```
|
||||||
|
|
||||||
Another parameter you can use with TGI backend is `details`. You can get more details on generation (tokens, probabilities, etc.) by setting `details` to `True`. When it's specified, TGI will return a `TextGenerationResponse` or `TextGenerationStreamResponse` rather than a string or stream.
|
## UI
|
||||||
|
|
||||||
```python
|
### Gradio
|
||||||
output = client.text_generation(prompt="Meaning of life is", details=True)
|
|
||||||
print(output)
|
|
||||||
|
|
||||||
# TextGenerationResponse(generated_text=' a complex concept that is not always clear to the individual. It is a concept that is not always', details=Details(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=20, seed=None, prefill=[], tokens=[Token(id=267, text=' a', logprob=-2.0723474, special=False), Token(id=11235, text=' complex', logprob=-3.1272552, special=False), Token(id=17908, text=' concept', logprob=-1.3632495, special=False),..))
|
|
||||||
```
|
|
||||||
|
|
||||||
You can see how to stream below.
|
|
||||||
|
|
||||||
```python
|
|
||||||
output = client.text_generation(prompt="Meaning of life is", stream=True, details=True)
|
|
||||||
print(next(iter(output)))
|
|
||||||
|
|
||||||
# TextGenerationStreamResponse(token=Token(id=267, text=' a', logprob=-2.0723474, special=False), generated_text=None, details=None)
|
|
||||||
```
|
|
||||||
|
|
||||||
You can check out the details of the function [here](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation). There is also an async version of the client, `AsyncInferenceClient`, based on `asyncio` and `aiohttp`. You can find docs for it [here](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.AsyncInferenceClient)
|
|
||||||
|
|
||||||
|
|
||||||
## ChatUI
|
|
||||||
|
|
||||||
ChatUI is an open-source interface built for LLM serving. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
|
|
||||||
|
|
||||||
To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.
|
|
||||||
|
|
||||||
```
|
|
||||||
{
|
|
||||||
// rest of the model config here
|
|
||||||
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
![ChatUI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chatui_screen.png)
|
|
||||||
|
|
||||||
## Gradio
|
|
||||||
|
|
||||||
Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.
|
Gradio is a Python library that helps you build web applications for your machine learning models with a few lines of code. It has a `ChatInterface` wrapper that helps create neat UIs for chatbots. Let's take a look at how to create a chatbot with streaming mode using TGI and Gradio. Let's install Gradio and Hub Python library first.
|
||||||
|
|
||||||
|
@ -89,19 +133,28 @@ Assume you are serving your model on port 8080, we will query through [Inference
|
||||||
import gradio as gr
|
import gradio as gr
|
||||||
from huggingface_hub import InferenceClient
|
from huggingface_hub import InferenceClient
|
||||||
|
|
||||||
client = InferenceClient(model="http://127.0.0.1:8080")
|
client = InferenceClient(base_url="http://127.0.0.1:8080")
|
||||||
|
|
||||||
def inference(message, history):
|
def inference(message, history):
|
||||||
partial_message = ""
|
partial_message = ""
|
||||||
for token in client.text_generation(message, max_new_tokens=20, stream=True):
|
output = client.chat.completions.create(
|
||||||
partial_message += token
|
messages=[
|
||||||
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
|
{"role": "user", "content": message},
|
||||||
|
],
|
||||||
|
stream=True,
|
||||||
|
max_tokens=1024,
|
||||||
|
)
|
||||||
|
|
||||||
|
for chunk in output:
|
||||||
|
partial_message += chunk.choices[0].delta.content
|
||||||
yield partial_message
|
yield partial_message
|
||||||
|
|
||||||
gr.ChatInterface(
|
gr.ChatInterface(
|
||||||
inference,
|
inference,
|
||||||
chatbot=gr.Chatbot(height=300),
|
chatbot=gr.Chatbot(height=300),
|
||||||
textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),
|
textbox=gr.Textbox(placeholder="Chat with me!", container=False, scale=7),
|
||||||
description="This is the demo for Gradio UI consuming TGI endpoint with LLaMA 7B-Chat model.",
|
description="This is the demo for Gradio UI consuming TGI endpoint.",
|
||||||
title="Gradio 🤝 TGI",
|
title="Gradio 🤝 TGI",
|
||||||
examples=["Are tomatoes vegetables?"],
|
examples=["Are tomatoes vegetables?"],
|
||||||
retry_btn="Retry",
|
retry_btn="Retry",
|
||||||
|
@ -110,20 +163,7 @@ gr.ChatInterface(
|
||||||
).queue().launch()
|
).queue().launch()
|
||||||
```
|
```
|
||||||
|
|
||||||
The UI looks like this 👇
|
You can check out the UI and try the demo directly here 👇
|
||||||
|
|
||||||
<div class="flex justify-center">
|
|
||||||
<img
|
|
||||||
class="block dark:hidden"
|
|
||||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi.png"
|
|
||||||
/>
|
|
||||||
<img
|
|
||||||
class="hidden dark:block"
|
|
||||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/gradio-tgi-dark.png"
|
|
||||||
/>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
You can try the demo directly here 👇
|
|
||||||
|
|
||||||
<div class="block dark:hidden">
|
<div class="block dark:hidden">
|
||||||
<iframe
|
<iframe
|
||||||
|
@ -141,15 +181,19 @@ You can try the demo directly here 👇
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|
||||||
You can disable streaming mode using `return` instead of `yield` in your inference function, like below.
|
|
||||||
|
|
||||||
```python
|
|
||||||
def inference(message, history):
|
|
||||||
return client.text_generation(message, max_new_tokens=20)
|
|
||||||
```
|
|
||||||
|
|
||||||
You can read more about how to customize a `ChatInterface` [here](https://www.gradio.app/guides/creating-a-chatbot-fast).
|
You can read more about how to customize a `ChatInterface` [here](https://www.gradio.app/guides/creating-a-chatbot-fast).
|
||||||
|
|
||||||
## API documentation
|
### ChatUI
|
||||||
|
|
||||||
You can consult the OpenAPI documentation of the `text-generation-inference` REST API using the `/docs` route. The Swagger UI is also available [here](https://huggingface.github.io/text-generation-inference).
|
[ChatUI](https://github.com/huggingface/chat-ui) is an open-source interface built for consuming LLMs. It offers many customization options, such as web search with SERP API and more. ChatUI can automatically consume the TGI server and even provides an option to switch between different TGI endpoints. You can try it out at [Hugging Chat](https://huggingface.co/chat/), or use the [ChatUI Docker Space](https://huggingface.co/new-space?template=huggingchat/chat-ui-template) to deploy your own Hugging Chat to Spaces.
|
||||||
|
|
||||||
|
To serve both ChatUI and TGI in same environment, simply add your own endpoints to the `MODELS` variable in `.env.local` file inside the `chat-ui` repository. Provide the endpoints pointing to where TGI is served.
|
||||||
|
|
||||||
|
```
|
||||||
|
{
|
||||||
|
// rest of the model config here
|
||||||
|
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
![ChatUI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/chatui_screen.png)
|
|
@ -4,7 +4,7 @@ Text Generation Inference (TGI) now supports [JSON and regex grammars](#grammar-
|
||||||
|
|
||||||
These feature are available starting from version `1.4.3`. They are accessible via the [`huggingface_hub`](https://pypi.org/project/huggingface-hub/) library. The tool support is compatible with OpenAI's client libraries. The following guide will walk you through the new features and how to use them!
|
These feature are available starting from version `1.4.3`. They are accessible via the [`huggingface_hub`](https://pypi.org/project/huggingface-hub/) library. The tool support is compatible with OpenAI's client libraries. The following guide will walk you through the new features and how to use them!
|
||||||
|
|
||||||
_note: guidance is supported as grammar in the `/generate` endpoint and as tools in the `/chat/completions` endpoint._
|
_note: guidance is supported as grammar in the `/generate` endpoint and as tools in the `v1/chat/completions` endpoint._
|
||||||
|
|
||||||
## How it works
|
## How it works
|
||||||
|
|
||||||
|
|
|
@ -48,34 +48,29 @@ To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate o
|
||||||
```python
|
```python
|
||||||
from huggingface_hub import InferenceClient
|
from huggingface_hub import InferenceClient
|
||||||
|
|
||||||
client = InferenceClient("http://127.0.0.1:8080")
|
client = InferenceClient(base_url="http://127.0.0.1:8080")
|
||||||
for token in client.text_generation("How do you make cheese?", max_new_tokens=12, stream=True):
|
output = client.chat.completions.create(
|
||||||
print(token)
|
messages=[
|
||||||
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
|
{"role": "user", "content": "Count to 10"},
|
||||||
|
],
|
||||||
|
stream=True,
|
||||||
|
max_tokens=1024,
|
||||||
|
)
|
||||||
|
|
||||||
# To
|
for chunk in output:
|
||||||
# make
|
print(chunk.choices[0].delta.content)
|
||||||
# cheese
|
|
||||||
#,
|
|
||||||
# you
|
|
||||||
# need
|
|
||||||
# to
|
|
||||||
# start
|
|
||||||
# with
|
|
||||||
# milk
|
|
||||||
#.
|
|
||||||
```
|
|
||||||
|
|
||||||
If you want additional details, you can add `details=True`. In this case, you get a `TextGenerationStreamResponse` which contains additional information such as the probabilities and the tokens. For the final response in the stream, it also returns the full generated text.
|
# 1
|
||||||
|
# 2
|
||||||
```python
|
# 3
|
||||||
for details in client.text_generation("How do you make cheese?", max_new_tokens=12, details=True, stream=True):
|
# 4
|
||||||
print(details)
|
# 5
|
||||||
|
# 6
|
||||||
#TextGenerationStreamResponse(token=Token(id=193, text='\n', logprob=-0.007358551, special=False), generated_text=None, details=None)
|
# 7
|
||||||
#TextGenerationStreamResponse(token=Token(id=2044, text='To', logprob=-1.1357422, special=False), generated_text=None, details=None)
|
# 8
|
||||||
#TextGenerationStreamResponse(token=Token(id=717, text=' make', logprob=-0.009841919, special=False), generated_text=None, details=None)
|
# 9
|
||||||
#...
|
# 10
|
||||||
#TextGenerationStreamResponse(token=Token(id=25, text='.', logprob=-1.3408203, special=False), generated_text='\nTo make cheese, you need to start with milk.', details=StreamDetails(finish_reason=<FinishReason.Length: 'length'>, generated_tokens=12, seed=None))
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case you need to handle the requests concurrently.
|
The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case you need to handle the requests concurrently.
|
||||||
|
@ -83,31 +78,46 @@ The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case
|
||||||
```python
|
```python
|
||||||
from huggingface_hub import AsyncInferenceClient
|
from huggingface_hub import AsyncInferenceClient
|
||||||
|
|
||||||
client = AsyncInferenceClient("http://127.0.0.1:8080")
|
client = AsyncInferenceClient(base_url="http://127.0.0.1:8080")
|
||||||
async for token in await client.text_generation("How do you make cheese?", stream=True):
|
async def main():
|
||||||
print(token)
|
stream = await client.chat.completions.create(
|
||||||
|
messages=[{"role": "user", "content": "Say this is a test"}],
|
||||||
|
stream=True,
|
||||||
|
)
|
||||||
|
async for chunk in stream:
|
||||||
|
print(chunk.choices[0].delta.content or "", end="")
|
||||||
|
|
||||||
# To
|
asyncio.run(main())
|
||||||
# make
|
|
||||||
# cheese
|
# This
|
||||||
#,
|
# is
|
||||||
# you
|
# a
|
||||||
# need
|
# test
|
||||||
# to
|
|
||||||
# start
|
|
||||||
# with
|
|
||||||
# milk
|
|
||||||
#.
|
#.
|
||||||
```
|
```
|
||||||
|
|
||||||
### Streaming with cURL
|
### Streaming with cURL
|
||||||
|
|
||||||
To use the `generate_stream` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
|
To use the OpenAI Chat Completions compatible Messages API `v1/chat/completions` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
|
||||||
|
|
||||||
```curl
|
```curl
|
||||||
curl -N 127.0.0.1:8080/generate_stream \
|
curl localhost:8080/v1/chat/completions \
|
||||||
-X POST \
|
-X POST \
|
||||||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
|
-d '{
|
||||||
|
"model": "tgi",
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are a helpful assistant."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "What is deep learning?"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"stream": true,
|
||||||
|
"max_tokens": 20
|
||||||
|
}' \
|
||||||
-H 'Content-Type: application/json'
|
-H 'Content-Type: application/json'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue