add some more basic info in README.md
This commit is contained in:
parent
b665e2fa0a
commit
8fad7ae5a2
|
@ -1,6 +1,46 @@
|
||||||
|
# Text Generation Inference - TensorRT-LLM Backend Implementation
|
||||||
|
|
||||||
|
## Description
|
||||||
|
|
||||||
|
This folder provides the sources of the TensorRT-LLM backend implementation powered by TensorRT-LLM Executor new API
|
||||||
|
|
||||||
|
## Simplified Request Sequence
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
sequenceDiagram
|
sequenceDiagram
|
||||||
TensorRtLlmBackend -->> TensorRtLlmBackendImpl: New thread which instantiates actual backend impl
|
actor User
|
||||||
TensorRtLlmBackendImpl -->> TensorRtLlmBackendImpl.Receiver: Awaits incoming request sent throught the queue
|
participant TextGenerationInference.HttpServer
|
||||||
|
participant TextGenerationInference.TensorRtLlmBackend
|
||||||
|
participant TextGenerationInference.TensorRtLlmWorkerThread
|
||||||
|
participant TensorRtLlm.Executor
|
||||||
|
participant Nvidia.Gpu
|
||||||
|
User ->> TextGenerationInference.HttpServer: POST /generate
|
||||||
|
TextGenerationInference.HttpServer ->> TextGenerationInference.TensorRtLlmBackend: Validate and forward inputs & parameters
|
||||||
|
TextGenerationInference.TensorRtLlmBackend ->> TextGenerationInference.TensorRtLlmWorkerThread: Allocate a new context and spawn a new thread to handle the request
|
||||||
|
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Submit the request to the In-Flight Batcher
|
||||||
|
activate Nvidia.Gpu
|
||||||
|
TensorRtLlm.Executor ->> Nvidia.Gpu: Add the request to the poll for execution
|
||||||
|
TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Response with an unique request identifier
|
||||||
|
rect rgb(10, 92, 54)
|
||||||
|
loop every 100us
|
||||||
|
rect rgb(15, 81, 50)
|
||||||
|
alt Acquire lock to query executor
|
||||||
|
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Poll request number of new token(s) generated
|
||||||
|
else There are new generated tokens
|
||||||
|
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Retrieve newly generated tokens
|
||||||
|
TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Return decoded token information and potential error (omitted)
|
||||||
|
rect rgb(11, 110, 79)
|
||||||
|
alt Generated token is final
|
||||||
|
TensorRtLlm.Executor ->> Nvidia.Gpu: Remove request from the scheduler and from the GPU
|
||||||
|
TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream the remaining decoded tokens and flush the connection
|
||||||
|
else Generated token is not final
|
||||||
|
TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream token back to the user as they get decoded
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end
|
||||||
|
end
|
||||||
|
deactivate Nvidia.Gpu
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
```
|
```
|
Loading…
Reference in New Issue