hf_text-generation-inference/docs/source/conceptual/streaming.md

# Streaming


## What is Streaming?

Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.

<div class="flex justify-center">
    <img
        class="block dark:hidden"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif"
    />
    <img
        class="hidden dark:block"
        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif"
    />
</div>

With token streaming, the server can start returning the tokens one by one before having to generate the whole response. Users can have a sense of the generation's quality before the end of the generation. This has different positive effects:

* Users can get results orders of magnitude earlier for extremely long queries.
* Seeing something in progress allows users to stop the generation if it's not going in the direction they expect.
* Perceived latency is lower when results are shown in the early stages.
* When used in conversational UIs, the experience feels more natural.

For example, a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they can see half of the generation after five seconds. Below you can see an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.

<div class="block dark:hidden">
	<iframe
        src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light"
        width="850"
        height="350"
    ></iframe>
</div>
<div class="hidden dark:block">
    <iframe
        src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark"
        width="850"
        height="350"
    ></iframe>
</div>

## How to use Streaming?

### Streaming with Python

To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response.

```python
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:8080")
output = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

# 1
# 2
# 3
# 4
# 5
# 6
# 7
# 8
# 9
# 10
```

The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case you need to handle the requests concurrently.

```python
from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient(base_url="http://127.0.0.1:8080")
async def main():
    stream = await client.chat.completions.create(
        messages=[{"role": "user", "content": "Say this is a test"}],
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="")

asyncio.run(main())

# This
# is
# a
# test
#.
```

### Streaming with cURL

To use the OpenAI Chat Completions compatible Messages API `v1/chat/completions` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server

```curl
curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'
```

### Streaming with JavaScript

First, we need to install the `@huggingface/inference` library.
`npm install @huggingface/inference`

If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`.

We can create a `HfInferenceEndpoint` providing our endpoint URL and credential.

```js
import { HfInferenceEndpoint } from '@huggingface/inference'

const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')

// prompt
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'

const stream = hf.textGenerationStream({ inputs: prompt })
for await (const r of stream) {
  // yield the generated token
  process.stdout.write(r.token.text)
}
```

## How does Streaming work under the hood?

Under the hood, TGI uses Server-Sent Events (SSE). In an SSE Setup, a client sends a request with the data, opening an HTTP connection and subscribing to updates. Afterward, the server sends data to the client. There is no need for further requests; the server will keep sending the data. SSEs are unidirectional, meaning the client does not send other requests to the server. SSE sends data over HTTP, making it easy to use.

SSEs are different than:
* Polling: where the client keeps calling the server to get data. This means that the server might return empty responses and cause overhead.
* Webhooks: where there is a bi-directional connection. The server can send information to the client, but the client can also send data to the server after the first request. Webhooks are more complex to operate as they don’t only use HTTP.

If there are too many requests at the same time, TGI returns an HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g., it could display a busy error to the user or retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing clients to handle backpressure.
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								# Streaming
-												Adding links to Adyen blogpost. (#2492)


											
										
										
											2024-09-05 08:11:52 -06:00
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								## What is Streaming?
 								Token streaming is the mode in which the server returns the tokens one by one as the model generates them. This enables showing progressive generations to the user rather than waiting for the whole generation. Streaming is an essential aspect of the end-user experience as it reduces latency, one of the most critical aspects of a smooth experience.
 								<div class="flex justify-center">
-												chore: add pre-commit (#1569)


											
										
										
											2024-02-16 03:58:58 -07:00
+								    <img
 								        class="block dark:hidden"
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual_360.gif"
 								    />
-												chore: add pre-commit (#1569)


											
										
										
											2024-02-16 03:58:58 -07:00
+								    <img
 								        class="hidden dark:block"
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								        src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/streaming-generation-visual-dark_360.gif"
 								    />
 								</div>
-												fix typos in docs and add small clarifications (#1790)

# What does this PR do?

Fix some small typos in the docs; add minor clarifications; add guidance
to features on landing page

## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

@OlivierDehaene
											
										
										
											2024-04-22 10:15:48 -06:00
+								With token streaming, the server can start returning the tokens one by one before having to generate the whole response. Users can have a sense of the generation's quality before the end of the generation. This has different positive effects:
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
 								* Users can get results orders of magnitude earlier for extremely long queries.
 								* Seeing something in progress allows users to stop the generation if it's not going in the direction they expect.
 								* Perceived latency is lower when results are shown in the early stages.
 								* When used in conversational UIs, the experience feels more natural.
 								For example, a system can generate 100 tokens per second. If the system generates 1000 tokens, with the non-streaming setup, users need to wait 10 seconds to get results. On the other hand, with the streaming setup, users get initial results immediately, and although end-to-end latency will be the same, they can see half of the generation after five seconds. Below you can see an interactive demo that shows non-streaming vs streaming side-by-side. Click **generate** below.
 								<div class="block dark:hidden">
-												chore: add pre-commit (#1569)


											
										
										
											2024-02-16 03:58:58 -07:00
+									<iframe
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								        src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=light"
 								        width="850"
 								        height="350"
 								    ></iframe>
 								</div>
 								<div class="hidden dark:block">
-												chore: add pre-commit (#1569)


											
										
										
											2024-02-16 03:58:58 -07:00
+								    <iframe
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								        src="https://osanseviero-streaming-vs-non-streaming.hf.space?__theme=dark"
 								        width="850"
 								        height="350"
 								    ></iframe>
 								</div>
 								## How to use Streaming?
 								### Streaming with Python
-												chore: add pre-commit (#1569)


											
										
										
											2024-02-16 03:58:58 -07:00
+								To stream tokens with `InferenceClient`, simply pass `stream=True` and iterate over the response.
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
 								```python
 								from huggingface_hub import InferenceClient
-												Improve the Consuming TGI + Streaming docs. (#2412)

* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
											
										
										
											2024-08-16 04:43:08 -06:00
+								client = InferenceClient(base_url="http://127.0.0.1:8080")
 								output = client.chat.completions.create(
 								    messages=[
 								        {"role": "system", "content": "You are a helpful assistant."},
 								        {"role": "user", "content": "Count to 10"},
 								    ],
 								    stream=True,
 								    max_tokens=1024,
 								)
 								for chunk in output:
 								    print(chunk.choices[0].delta.content)
 								# 1
 								# 2
 								# 3
 								# 4
 								# 5
 								# 6
 								# 7
 								# 8
 								# 9
 								# 10
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								```
 								The `huggingface_hub` library also comes with an `AsyncInferenceClient` in case you need to handle the requests concurrently.
 								```python
 								from huggingface_hub import AsyncInferenceClient
-												Improve the Consuming TGI + Streaming docs. (#2412)

* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
											
										
										
											2024-08-16 04:43:08 -06:00
+								client = AsyncInferenceClient(base_url="http://127.0.0.1:8080")
 								async def main():
 								    stream = await client.chat.completions.create(
 								        messages=[{"role": "user", "content": "Say this is a test"}],
 								        stream=True,
 								    )
 								    async for chunk in stream:
 								        print(chunk.choices[0].delta.content or "", end="")
 								asyncio.run(main())
 								# This
 								# is
 								# a
 								# test
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								#.
 								```
 								### Streaming with cURL
-												Improve the Consuming TGI + Streaming docs. (#2412)

* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
											
										
										
											2024-08-16 04:43:08 -06:00
+								To use the OpenAI Chat Completions compatible Messages API `v1/chat/completions` endpoint with curl, you can add the `-N` flag, which disables curl default buffering and shows data as it arrives from the server
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
 								```curl
-												Improve the Consuming TGI + Streaming docs. (#2412)

* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
											
										
										
											2024-08-16 04:43:08 -06:00
+								curl localhost:8080/v1/chat/completions \
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								    -X POST \
-												Improve the Consuming TGI + Streaming docs. (#2412)

* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
											
										
										
											2024-08-16 04:43:08 -06:00
+								    -d '{
 								  "model": "tgi",
 								  "messages": [
 								    {
 								      "role": "system",
 								      "content": "You are a helpful assistant."
 								    },
 								    {
 								      "role": "user",
 								      "content": "What is deep learning?"
 								    }
 								  ],
 								  "stream": true,
 								  "max_tokens": 20
 								}' \
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								    -H 'Content-Type: application/json'
 								```
 								### Streaming with JavaScript
 								First, we need to install the `@huggingface/inference` library.
 								`npm install @huggingface/inference`
-												fix typos in docs and add small clarifications (#1790)

# What does this PR do?

Fix some small typos in the docs; add minor clarifications; add guidance
to features on landing page

## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

@OlivierDehaene
											
										
										
											2024-04-22 10:15:48 -06:00
+								If you're using the free Inference API, you can use `HfInference`. If you're using inference endpoints, you can use `HfInferenceEndpoint`.
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
 								We can create a `HfInferenceEndpoint` providing our endpoint URL and credential.
 								```js
-												docs: typo in streaming.js (#971)

Looks like an error
											
										
										
											2023-09-06 06:57:59 -06:00
+								import { HfInferenceEndpoint } from '@huggingface/inference'
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
-												docs: typo in streaming.js (#971)

Looks like an error
											
										
										
											2023-09-06 06:57:59 -06:00
+								const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
 								// prompt
 								const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
 								const stream = hf.textGenerationStream({ inputs: prompt })
-												chore: add pre-commit (#1569)


											
										
										
											2024-02-16 03:58:58 -07:00
+								for await (const r of stream) {
-												Add streaming guide (#858)

Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Merve Noyan <merveenoyan@gmail.com>
											
										
										
											2023-08-18 05:27:08 -06:00
+								  // yield the generated token
 								  process.stdout.write(r.token.text)
 								}
 								```
 								## How does Streaming work under the hood?
 								Under the hood, TGI uses Server-Sent Events (SSE). In an SSE Setup, a client sends a request with the data, opening an HTTP connection and subscribing to updates. Afterward, the server sends data to the client. There is no need for further requests; the server will keep sending the data. SSEs are unidirectional, meaning the client does not send other requests to the server. SSE sends data over HTTP, making it easy to use.
 								SSEs are different than:
 								* Polling: where the client keeps calling the server to get data. This means that the server might return empty responses and cause overhead.
 								* Webhooks: where there is a bi-directional connection. The server can send information to the client, but the client can also send data to the server after the first request. Webhooks are more complex to operate as they don’t only use HTTP.
-												docs: Remove redundant content from stream guide (#884)

Co-authored-by: OlivierDehaene <olivier@huggingface.co>
											
										
										
											2023-09-06 10:42:42 -06:00
+								If there are too many requests at the same time, TGI returns an HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g., it could display a busy error to the user or retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing clients to handle backpressure.