From aaea212d0f53929cd3775af3eaf06f4af0a868a5 Mon Sep 17 00:00:00 2001 From: Martin Iglesias Goyanes Date: Fri, 6 Sep 2024 17:00:54 +0200 Subject: [PATCH] Add links to Adyen blogpost (#2500) * Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by: Nicolas Patry --- README.md | 2 +- docs/source/_toctree.yml | 2 ++ docs/source/conceptual/external.md | 4 ++++ docs/source/conceptual/streaming.md | 4 ---- 4 files changed, 7 insertions(+), 5 deletions(-) create mode 100644 docs/source/conceptual/external.md diff --git a/README.md b/README.md index cf6a30db..cc9d523f 100644 --- a/README.md +++ b/README.md @@ -189,7 +189,7 @@ overridden with the `--otlp-service-name` argument ![TGI architecture](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/TGI.png) -Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi) +Detailed blogpost by Adyen on TGI inner workings: [LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi) ### Local install diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index f52fa2ec..b883b36d 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -71,6 +71,8 @@ title: How Guidance Works (via outlines) - local: conceptual/lora title: LoRA (Low-Rank Adaptation) + - local: conceptual/external + title: External Resources title: Conceptual Guides diff --git a/docs/source/conceptual/external.md b/docs/source/conceptual/external.md new file mode 100644 index 00000000..9cbe1b5a --- /dev/null +++ b/docs/source/conceptual/external.md @@ -0,0 +1,4 @@ +# External Resources + +- Adyen wrote a detailed article about the interplay between TGI's main components: router and server. +[LLM inference at scale with TGI (Martin Iglesias Goyanes - Adyen, 2024)](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi) diff --git a/docs/source/conceptual/streaming.md b/docs/source/conceptual/streaming.md index f1f37f2a..b8154ba4 100644 --- a/docs/source/conceptual/streaming.md +++ b/docs/source/conceptual/streaming.md @@ -155,7 +155,3 @@ SSEs are different than: * Webhooks: where there is a bi-directional connection. The server can send information to the client, but the client can also send data to the server after the first request. Webhooks are more complex to operate as they don’t only use HTTP. If there are too many requests at the same time, TGI returns an HTTP Error with an `overloaded` error type (`huggingface_hub` returns `OverloadedError`). This allows the client to manage the overloaded server (e.g., it could display a busy error to the user or retry with a new request). To configure the maximum number of concurrent requests, you can specify `--max_concurrent_requests`, allowing clients to handle backpressure. - -## External sources - -Adyen wrote a nice recap of how TGI streaming feature works. [LLM inference at scale with TGI](https://www.adyen.com/knowledge-hub/llm-inference-at-scale-with-tgi)