History

Funtowicz Morgan ba5fc7d922 Add support for stop words in TRTLLM (#2678 ) * feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json		2024-10-25 10:58:34 +02:00
..
cmake	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 )	2024-10-25 07:17:14 +02:00
include	Add support for stop words in TRTLLM (#2678 )	2024-10-25 10:58:34 +02:00
lib	Add support for stop words in TRTLLM (#2678 )	2024-10-25 10:58:34 +02:00
scripts	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 )	2024-10-25 07:17:14 +02:00
src	Add support for stop words in TRTLLM (#2678 )	2024-10-25 10:58:34 +02:00
tests	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
CMakeLists.txt	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 )	2024-10-25 07:17:14 +02:00
Cargo.toml	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 )	2024-10-25 07:17:14 +02:00
README.md	Rebase TRT-llm (#2331 )	2024-07-31 10:33:10 +02:00
build.rs	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 )	2024-10-25 07:17:14 +02:00

README.md

Text Generation Inference - TensorRT-LLM Backend Implementation

Description

This folder provides the sources of the TensorRT-LLM backend implementation powered by TensorRT-LLM Executor new API

Simplified Request Sequence

sequenceDiagram
    actor User
    participant TextGenerationInference.HttpServer
    participant TextGenerationInference.TensorRtLlmBackend
    participant TextGenerationInference.TensorRtLlmWorkerThread
    participant TensorRtLlm.Executor
    participant Nvidia.Gpu
    User ->> TextGenerationInference.HttpServer: POST /generate
    TextGenerationInference.HttpServer ->> TextGenerationInference.TensorRtLlmBackend: Validate and forward inputs & parameters
    TextGenerationInference.TensorRtLlmBackend ->> TextGenerationInference.TensorRtLlmWorkerThread: Allocate a new context and spawn a new thread to handle the request
    TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Submit the request to the In-Flight Batcher
    activate Nvidia.Gpu
    TensorRtLlm.Executor ->> Nvidia.Gpu: Add the request to the poll for execution
    TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Response with an unique request identifier
    rect rgb(10, 92, 54)
        loop every 100us
            rect rgb(15, 81, 50)
                alt Acquire lock to query executor
                    TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Poll request number of new token(s) generated
                else There are new generated tokens
                    TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Retrieve newly generated tokens
                    TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Return decoded token information and potential error (omitted)
                    rect rgb(11, 110, 79)
                        alt Generated token is final
                            TensorRtLlm.Executor ->> Nvidia.Gpu: Remove request from the scheduler and from the GPU
                            TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream the remaining decoded tokens and flush the connection
                        else Generated token is not final
                            TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream token back to the user as they get decoded
                        end
                    end
                end
            end
            deactivate Nvidia.Gpu
        end
    end