hf_text-generation-inference/router
Nicolas Patry 2b19d671b4
Rebase TRT-llm (#2331)
* wip

wip

refacto

refacto

Initial setup for CXX binding to TRTLLM

Working FFI call for TGI and TRTLLM backend

Remove unused parameters annd force tokenizer name to be set

Overall build TRTLLM and deps through CMake build system

Enable end to end CMake build

First version loading engines and making it ready for inference

Remembering to check how we can detect support for chunked context

Move to latest TensorRT-LLM version

Specify which default log level to use depending on CMake build type

make leader executor mode working

unconditionally call InitializeBackend on the FFI layer

bind to CUDA::nvml to retrieve compute capabilities at runtime

updated logic and comment to detect cuda compute capabilities

implement the Stream method to send new tokens through a callback

use spdlog release 1.14.1 moving forward

update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c

correctly tell cmake to build dependent tensorrt-llm required libraries

create cmake install target to put everything relevant in installation folder

add auth_token CLI argument to provide hf hub authentification token

allow converting huggingface::tokenizers error to TensorRtLlmBackendError

use correct include for spdlog

include guard to build example in cmakelists

working setup of the ffi layer

remove fmt import

use external fmt lib

end to end ffi flow working

make sure to track include/ffi.h to trigger rebuild from cargo

impl the rust backend which currently cannot move the actual computation in background thread

expose shutdown function at ffi layer

impl RwLock scenario for TensorRtLllmBackend

oops missing c++ backend definitions

compute the number of maximum new tokens for each request independently

make sure the context is not dropped in the middle of the async decoding.

remove unnecessary log

add all the necessary plumbery to return the generated content

update invalid doc in cpp file

correctly forward back the log probabilities

remove unneeded scope variable for now

refactor Stream impl for Generation to factorise code

expose the internal missing start/queue timestamp

forward tgi parameters rep/freq penalty

add some more validation about grammar not supported

define a shared struct to hold the result of a decoding step

expose information about potential error happening while decoding

remove logging

add logging in case of decoding error

make sure executor_worker is provided

add initial Dockerfile for TRTLLM backend

add some more information in CMakeLists.txt to correctly install executorWorker

add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper

simplify prebuilt trtllm libraries name definition

do the same name definition stuff for tensorrt_llm_executor_static

leverage pkg-config to probe libraries paths and reuse new install structure from cmake

fix bad copy/past missing nvinfer linkage direction

align all the linker search dependency

add missing pkgconfig folder for MPI in Dockerfile

correctly setup linking search path for runtime layer

fix missing / before tgi lib path

adding missing ld_library_path for cuda stubs in Dockerfile

update tgi entrypoint

commenting out Python part for TensorRT installation

refactored docker image

move to TensorRT-LLM v0.11.0

make docker linter happy with same capitalization rule

fix typo

refactor the compute capabilities detection along with num gpus

update TensorRT-LLM to latest version

update TensorRT install script to latest

update build.rs to link to cuda 12.5

add missing dependant libraries for linking

clean up a bit

install to decoder_attention target

add some custom stuff for nccl linkage

fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time

use std::env::const::ARCH

make sure variable live long enough...

look for cuda 12.5

add some more basic info in README.md

* Rebase.

* Fix autodocs.

* Let's try to enable trtllm backend.

* Ignore backends/v3 by default.

* Fixing client.

* Fix makefile + autodocs.

* Updating the schema thing + redocly.

* Fix trtllm lint.

* Adding pb files ?

* Remove cargo fmt temporarily.

* ?

* Tmp.

* Remove both check + clippy  ?

* Backporting telemetry.

* Backporting 457fb0a1

* Remove PB from git.

* Fixing PB with default member backends/client

* update TensorRT-LLM to latest version

* provided None for api_key

* link against libtensorrt_llm and not libtensorrt-llm

---------

Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>
2024-07-31 10:33:10 +02:00
..
src Rebase TRT-llm (#2331) 2024-07-31 10:33:10 +02:00
Cargo.toml Rebase TRT-llm (#2331) 2024-07-31 10:33:10 +02:00
README.md chore: add pre-commit (#1569) 2024-02-16 11:58:58 +01:00
build.rs chore(github): add templates (#264) 2023-05-02 15:43:19 +02:00

README.md

Router

Also named webserver throughout the docs.

This router is handling most of the logic to handle the "batches" tell when to pass new prefill requests and pausing decode requests, which ones etc...

It uses gRPC to communicate with the shards which can therefore be kept much simpler and focus on having the most efficient forward passes as possible.

Continuous batching

One important feature of text-generation-inference is enabled by this router.

Continuous batching is the act of regularly running queries in the same forward step of the LLM (a "batch") and also removing them when they are finished.

In order for continuous batching to be useful, you need to have more compute available with respect to the memory requirements of your model. This is essentially true for LLMs and the larger the model, the truer it gets (since you have to pool multiple GPUs to load the model, you effectively have a lot of compute power at your hands).

Static batching is the act of doing several queries at the same time, but usually this is controlled by the client, and therefore the amount of batching is decided beforehand.

For text-generation, and LLMs which are memory bound we can try to be much more efficient with the available compute, by having client sending us single queries, and let the router mix&match queries into or out of batches to make the use the compute the most efficiently. This is possible because for LLMs the total compute for running the model is much bigger than doing mix&match of the batches themselves.

Simple continuous batching

text-generation works by feeding a prompt to a model, and iteratively calling forward on the model to produce new text, 1 token at a time.

The first idea is simple, when a query arrives, we start working on it directly. When new queries arrive, we simply wait for the current forward to be finished then batch the current running prompt with the new query, and call forward.

Whenever either query is finished: either the model produce EOS (end of sentence) token or the query reached the allowed limit. We simply drop it from the batch, remove all the allocated memory and we can continue with the rest until nothing is left.

This simple idea generalizes very well and we could potentially stack many requests in the same batch.

One thing to note, is that queries can be potentially run with different parameters meaning different way to choose the next token (sampling, not sampling, temperature, top_k etc..). This is not problematic for the proposed approach we just need to do the sampling independantly on each member of the batch.

Prefill, decode and past key values

In order to make LLMs and text-generation efficient, there's actually a very powerful trick that can be used, which is the "caching" of some attention matrices. More on that in the first part of this blog

What this means, is that the first "pass" of a prompt is different from the subsequent "forward" passes. Since for the first one we have to compute the entire attention matrix, whereas in the follow-ups only require to compute the new token attention. The first pass is called prefill throughout this codebase where as the follow-ups are called decode.

Since prefill is much more expensive than decode we don't want to do it all the time, but a currently running query is probably doing decode. If we want to do the continuous batching as explained previously we need to run prefill at some point in order to create the attention matrix required to be able to join the decode group.

text-generation-inference uses a bunch of different strategies and parameters in order to enable you to find the sweet spot between exploiting the hardware and perceived latency.

With no continuous batching at all, latency is going to be super good, but throughput (meaning the total number of requests allowed in a given timeframe) is going to be super bad (since it's essentially 1).

With static batching, you can probably reach the maximum throughput (by using the maximum total batch size applicable to your hardware), but the latency is super bad since in order to have maximum throughput you need to wait for requests to come in before processing.

With continuous batching you can find a sweet spot. In general latency is the most critical parameter users care about. But a 2x latency slowdown for 10x more users on the same hardware is an acceptable tradeoff.

Token streaming

This is a very important aspect of client UX. As mentionned above, latency is the most critical perceived quality of an LLM API.

With token streaming, the server can start answering after the first prefill pass directly, without waiting for all the generation to be done. For extremely long queries this means clients can start to see something happening orders of magnitude before the work is done. Seeing something in progress allows them to cut short if it's not what's wanted but also it "feels" better.