Rebase TRT-llm (#2331)
* wip
wip
refacto
refacto
Initial setup for CXX binding to TRTLLM
Working FFI call for TGI and TRTLLM backend
Remove unused parameters annd force tokenizer name to be set
Overall build TRTLLM and deps through CMake build system
Enable end to end CMake build
First version loading engines and making it ready for inference
Remembering to check how we can detect support for chunked context
Move to latest TensorRT-LLM version
Specify which default log level to use depending on CMake build type
make leader executor mode working
unconditionally call InitializeBackend on the FFI layer
bind to CUDA::nvml to retrieve compute capabilities at runtime
updated logic and comment to detect cuda compute capabilities
implement the Stream method to send new tokens through a callback
use spdlog release 1.14.1 moving forward
update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
correctly tell cmake to build dependent tensorrt-llm required libraries
create cmake install target to put everything relevant in installation folder
add auth_token CLI argument to provide hf hub authentification token
allow converting huggingface::tokenizers error to TensorRtLlmBackendError
use correct include for spdlog
include guard to build example in cmakelists
working setup of the ffi layer
remove fmt import
use external fmt lib
end to end ffi flow working
make sure to track include/ffi.h to trigger rebuild from cargo
impl the rust backend which currently cannot move the actual computation in background thread
expose shutdown function at ffi layer
impl RwLock scenario for TensorRtLllmBackend
oops missing c++ backend definitions
compute the number of maximum new tokens for each request independently
make sure the context is not dropped in the middle of the async decoding.
remove unnecessary log
add all the necessary plumbery to return the generated content
update invalid doc in cpp file
correctly forward back the log probabilities
remove unneeded scope variable for now
refactor Stream impl for Generation to factorise code
expose the internal missing start/queue timestamp
forward tgi parameters rep/freq penalty
add some more validation about grammar not supported
define a shared struct to hold the result of a decoding step
expose information about potential error happening while decoding
remove logging
add logging in case of decoding error
make sure executor_worker is provided
add initial Dockerfile for TRTLLM backend
add some more information in CMakeLists.txt to correctly install executorWorker
add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
simplify prebuilt trtllm libraries name definition
do the same name definition stuff for tensorrt_llm_executor_static
leverage pkg-config to probe libraries paths and reuse new install structure from cmake
fix bad copy/past missing nvinfer linkage direction
align all the linker search dependency
add missing pkgconfig folder for MPI in Dockerfile
correctly setup linking search path for runtime layer
fix missing / before tgi lib path
adding missing ld_library_path for cuda stubs in Dockerfile
update tgi entrypoint
commenting out Python part for TensorRT installation
refactored docker image
move to TensorRT-LLM v0.11.0
make docker linter happy with same capitalization rule
fix typo
refactor the compute capabilities detection along with num gpus
update TensorRT-LLM to latest version
update TensorRT install script to latest
update build.rs to link to cuda 12.5
add missing dependant libraries for linking
clean up a bit
install to decoder_attention target
add some custom stuff for nccl linkage
fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
use std::env::const::ARCH
make sure variable live long enough...
look for cuda 12.5
add some more basic info in README.md
* Rebase.
* Fix autodocs.
* Let's try to enable trtllm backend.
* Ignore backends/v3 by default.
* Fixing client.
* Fix makefile + autodocs.
* Updating the schema thing + redocly.
* Fix trtllm lint.
* Adding pb files ?
* Remove cargo fmt temporarily.
* ?
* Tmp.
* Remove both check + clippy ?
* Backporting telemetry.
* Backporting 457fb0a1
* Remove PB from git.
* Fixing PB with default member backends/client
* update TensorRT-LLM to latest version
* provided None for api_key
* link against libtensorrt_llm and not libtensorrt-llm
---------
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>
This commit is contained in:
parent
53aec27328
commit
2b19d671b4
|
@ -2,3 +2,5 @@ aml
|
|||
target
|
||||
server/transformers
|
||||
server/flash-attention
|
||||
cmake-build-debug/
|
||||
cmake-build-release/
|
||||
|
|
|
@ -28,7 +28,7 @@ jobs:
|
|||
|
||||
- name: Install router
|
||||
id: install-router
|
||||
run: cargo install --path router/
|
||||
run: cargo install --path backends/v3/
|
||||
|
||||
- uses: actions/setup-node@v4
|
||||
with:
|
||||
|
@ -41,5 +41,5 @@ jobs:
|
|||
|
||||
- name: Check that documentation is up-to-date
|
||||
run: |
|
||||
npm install -g swagger-cli
|
||||
npm install -g @redocly/cli
|
||||
python update_doc.py --check
|
||||
|
|
|
@ -3,6 +3,10 @@ target
|
|||
router/tokenizer.json
|
||||
*__pycache__*
|
||||
|
||||
backends/v3/src/client/pb
|
||||
backends/client/src/v2/pb
|
||||
backends/client/src/v3/pb
|
||||
|
||||
# ROCm auto-generated files
|
||||
*.hip
|
||||
server/exllamav2_kernels/exllamav2_kernels/hip/
|
||||
|
|
|
@ -13,8 +13,8 @@ repos:
|
|||
- repo: https://github.com/doublify/pre-commit-rust
|
||||
rev: v1.0
|
||||
hooks:
|
||||
- id: fmt
|
||||
- id: cargo-check
|
||||
- id: fmt
|
||||
- id: clippy
|
||||
- repo: https://github.com/astral-sh/ruff-pre-commit
|
||||
rev: v0.3.0
|
||||
|
|
|
@ -0,0 +1,79 @@
|
|||
# This file instructs Redocly's linter to ignore the rules contained for specific parts of your API.
|
||||
# See https://redoc.ly/docs/cli/ for more information.
|
||||
docs/openapi.json:
|
||||
no-empty-servers:
|
||||
- '#/openapi'
|
||||
spec:
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/best_of/exclusiveMinimum
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/frequency_penalty/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/grammar/nullable'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/repetition_penalty/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/seed/exclusiveMinimum'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/temperature/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/top_k/exclusiveMinimum'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/top_n_tokens/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateParameters/properties/top_p/exclusiveMinimum'
|
||||
- >-
|
||||
#/components/schemas/GenerateParameters/properties/typical_p/exclusiveMinimum
|
||||
- '#/components/schemas/GenerateResponse/properties/details/nullable'
|
||||
- '#/components/schemas/StreamResponse/properties/details/nullable'
|
||||
- '#/components/schemas/ChatRequest/properties/response_format/nullable'
|
||||
- '#/components/schemas/ChatRequest/properties/tool_choice/nullable'
|
||||
- '#/components/schemas/ToolChoice/nullable'
|
||||
- '#/components/schemas/ChatCompletionComplete/properties/logprobs/nullable'
|
||||
- '#/components/schemas/ChatCompletionChoice/properties/logprobs/nullable'
|
||||
no-invalid-media-type-examples:
|
||||
- '#/paths/~1/post/responses/422/content/application~1json/example'
|
||||
- '#/paths/~1/post/responses/424/content/application~1json/example'
|
||||
- '#/paths/~1/post/responses/429/content/application~1json/example'
|
||||
- '#/paths/~1/post/responses/500/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/422/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/424/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/429/content/application~1json/example'
|
||||
- '#/paths/~1generate/post/responses/500/content/application~1json/example'
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/422/content/text~1event-stream/example
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/424/content/text~1event-stream/example
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/429/content/text~1event-stream/example
|
||||
- >-
|
||||
#/paths/~1generate_stream/post/responses/500/content/text~1event-stream/example
|
||||
- '#/paths/~1tokenize/post/responses/404/content/application~1json/example'
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/422/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/424/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/429/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1chat~1completions/post/responses/500/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/422/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/424/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/429/content/application~1json/example
|
||||
- >-
|
||||
#/paths/~1v1~1completions/post/responses/500/content/application~1json/example
|
||||
operation-4xx-response:
|
||||
- '#/paths/~1health/get/responses'
|
||||
- '#/paths/~1info/get/responses'
|
||||
- '#/paths/~1metrics/get/responses'
|
||||
no-unused-components:
|
||||
- '#/components/schemas/Completion'
|
||||
security-defined:
|
||||
- '#/paths/~1/post'
|
||||
- '#/paths/~1generate/post'
|
||||
- '#/paths/~1generate_stream/post'
|
||||
- '#/paths/~1health/get'
|
||||
- '#/paths/~1info/get'
|
||||
- '#/paths/~1metrics/get'
|
||||
- '#/paths/~1tokenize/post'
|
||||
- '#/paths/~1v1~1chat~1completions/post'
|
||||
- '#/paths/~1v1~1completions/post'
|
File diff suppressed because it is too large
Load Diff
21
Cargo.toml
21
Cargo.toml
|
@ -1,10 +1,19 @@
|
|||
[workspace]
|
||||
members = [
|
||||
"benchmark",
|
||||
"router",
|
||||
"router/client",
|
||||
"router/grpc-metadata",
|
||||
"launcher"
|
||||
"benchmark",
|
||||
"backends/v3",
|
||||
"backends/grpc-metadata",
|
||||
"backends/trtllm",
|
||||
"backends/client",
|
||||
"launcher"
|
||||
]
|
||||
default-members = [
|
||||
"benchmark",
|
||||
"backends/v3",
|
||||
"backends/grpc-metadata",
|
||||
# "backends/trtllm",
|
||||
"backends/client",
|
||||
"launcher"
|
||||
]
|
||||
resolver = "2"
|
||||
|
||||
|
@ -18,6 +27,8 @@ homepage = "https://github.com/huggingface/text-generation-inference"
|
|||
base64 = "0.22.0"
|
||||
tokenizers = { version = "0.19.1", features = ["http"] }
|
||||
hf-hub = { version = "0.3.1", features = ["tokio"] }
|
||||
metrics = { version = "0.23.0" }
|
||||
metrics-exporter-prometheus = { version = "0.15.1", features = [] }
|
||||
|
||||
[profile.release]
|
||||
incremental = true
|
||||
|
|
|
@ -11,6 +11,7 @@ COPY rust-toolchain.toml rust-toolchain.toml
|
|||
COPY proto proto
|
||||
COPY benchmark benchmark
|
||||
COPY router router
|
||||
COPY backends backends
|
||||
COPY launcher launcher
|
||||
RUN cargo chef prepare --recipe-path recipe.json
|
||||
|
||||
|
@ -33,6 +34,7 @@ COPY rust-toolchain.toml rust-toolchain.toml
|
|||
COPY proto proto
|
||||
COPY benchmark benchmark
|
||||
COPY router router
|
||||
COPY backends backends
|
||||
COPY launcher launcher
|
||||
RUN cargo build --profile release-opt
|
||||
|
||||
|
|
|
@ -0,0 +1,23 @@
|
|||
# All the tooling for CUDA
|
||||
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 AS cuda-builder
|
||||
|
||||
WORKDIR /usr/src/tgi/backends/trtllm
|
||||
RUN apt update && apt install -y cmake git git-lfs gcc g++ ninja-build libopenmpi-dev python3-dev python3-pip wget
|
||||
|
||||
COPY . /usr/src/tgi
|
||||
RUN chmod +x scripts/install_tensorrt.sh && scripts/install_tensorrt.sh
|
||||
RUN cmake -G Ninja -B build -DTRT_LIB_DIR=/usr/local/tensorrt/lib -DTRT_INCLUDE_DIR=/usr/local/tensorrt/include .
|
||||
RUN cmake --build build --parallel -t tgi_trtllm_backend_impl
|
||||
|
||||
# All the tooling for Rust
|
||||
FROM lukemathwalker/cargo-chef:latest-rust-1.79 AS chef
|
||||
WORKDIR /usr/src
|
||||
|
||||
# Include CUDA related libraries and tools to the Rust based image
|
||||
COPY --from=cuda-builder /usr/local/cuda /usr/local/cuda
|
||||
COPY --from=cuda-builder /usr/local/tensorrt /usr/local/tensorrt
|
||||
COPY --from=cuda-builder /usr/src/tgi/backends/trtllm/build /usr/local/tgi/trtllm/build
|
||||
ENV PATH=/usr/local/cuda/bin:$PATH
|
||||
ENV LD_LIBRARY_PATH=/usr/local/tensorrt/lib:$LD_LIBRARY_PATH
|
||||
|
||||
RUN apt update && apt install -y cmake git gcc g++ ninja-build libopenmpi3
|
|
@ -11,6 +11,7 @@ COPY rust-toolchain.toml rust-toolchain.toml
|
|||
COPY proto proto
|
||||
COPY benchmark benchmark
|
||||
COPY router router
|
||||
COPY backends backends
|
||||
COPY launcher launcher
|
||||
RUN cargo chef prepare --recipe-path recipe.json
|
||||
|
||||
|
@ -33,6 +34,7 @@ COPY rust-toolchain.toml rust-toolchain.toml
|
|||
COPY proto proto
|
||||
COPY benchmark benchmark
|
||||
COPY router router
|
||||
COPY backends backends
|
||||
COPY launcher launcher
|
||||
RUN cargo build --profile release-opt
|
||||
|
||||
|
|
|
@ -12,6 +12,7 @@ COPY rust-toolchain.toml rust-toolchain.toml
|
|||
COPY proto proto
|
||||
COPY benchmark benchmark
|
||||
COPY router router
|
||||
COPY backends backends
|
||||
COPY launcher launcher
|
||||
RUN cargo chef prepare --recipe-path recipe.json
|
||||
|
||||
|
@ -34,6 +35,7 @@ COPY rust-toolchain.toml rust-toolchain.toml
|
|||
COPY proto proto
|
||||
COPY benchmark benchmark
|
||||
COPY router router
|
||||
COPY backends backends
|
||||
COPY launcher launcher
|
||||
RUN cargo build --profile release-opt
|
||||
|
||||
|
|
6
Makefile
6
Makefile
|
@ -5,13 +5,13 @@ install-server-cpu:
|
|||
cd server && make install-server
|
||||
|
||||
install-router:
|
||||
cd router && cargo install --path .
|
||||
cargo install --path backends/v3/
|
||||
|
||||
install-launcher:
|
||||
cd launcher && cargo install --path .
|
||||
cargo install --path launcher/
|
||||
|
||||
install-benchmark:
|
||||
cd benchmark && cargo install --path .
|
||||
cargo install --path benchmark/
|
||||
|
||||
install: install-server install-router install-launcher
|
||||
|
||||
|
|
|
@ -0,0 +1,63 @@
|
|||
cmake_minimum_required(VERSION 3.20)
|
||||
|
||||
project(tgi-trtllm-backend VERSION 1.0.0)
|
||||
set(CMAKE_CXX_STANDARD 20)
|
||||
|
||||
include(FetchContent)
|
||||
include(ExternalProject)
|
||||
|
||||
option(TGI_TRTLLM_BACKEND_BUILD_TESTS "Enable building the unittests suite" OFF)
|
||||
option(TGI_TRTLLM_BACKEND_BUILD_EXAMPLES "Enable building the examples suite" OFF)
|
||||
set(TGI_TRTLLM_BACKEND_TARGET_CUDA_ARCH_LIST "89-real" CACHE STRING "List of CUDA architectures to support")
|
||||
set(TGI_TRTLLM_BACKEND_TRT_ROOT "/usr/local/tensorrt" CACHE STRING "Path where TensorRT libraries and headers are located")
|
||||
set(TGI_TRTLLM_BACKEND_TRT_INCLUDE_DIR "${TGI_TRTLLM_BACKEND_TRT_ROOT}/include" CACHE STRING "Path where TensorRT headers are located")
|
||||
set(TGI_TRTLLM_BACKEND_TRT_LIB_DIR "${TGI_TRTLLM_BACKEND_TRT_ROOT}/lib" CACHE STRING "Path where TensorRT libraries are located")
|
||||
|
||||
# We are using nvidia-ml to query at runtime device information to enable some architecture-specific features
|
||||
find_package(CUDAToolkit 12.5 REQUIRED COMPONENTS CUDA::cudart CUDA::nvml)
|
||||
|
||||
#### External dependencies ####
|
||||
include(cmake/fmt.cmake)
|
||||
include(cmake/json.cmake)
|
||||
include(cmake/spdlog.cmake)
|
||||
include(cmake/trtllm.cmake)
|
||||
|
||||
# Let's build TRTLLM as part of CMake
|
||||
add_subdirectory("${trtllm_SOURCE_DIR}/cpp" "${trtllm_SOURCE_DIR}/..")
|
||||
|
||||
# Tell CMake to need try to override the RPATH for executorWorker as it has not information on how to do so
|
||||
set_target_properties(executorWorker PROPERTIES SKIP_BUILD_RPATH TRUE)
|
||||
|
||||
# TGI TRTLLM Backend definition
|
||||
add_library(tgi_trtllm_backend_impl STATIC include/backend.h lib/backend.cpp include/hardware.h)
|
||||
include_directories(${TGI_TRTLLM_BACKEND_TRT_INCLUDE_DIR})
|
||||
target_include_directories(tgi_trtllm_backend_impl PRIVATE
|
||||
$<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
|
||||
$<INSTALL_INTERFACE:include>
|
||||
)
|
||||
target_include_directories(tgi_trtllm_backend_impl PUBLIC "${trtllm_SOURCE_DIR}/cpp/include")
|
||||
target_link_libraries(tgi_trtllm_backend_impl PRIVATE tensorrt_llm nvinfer_plugin_tensorrt_llm tensorrt_llm_nvrtc_wrapper CUDA::cudart CUDA::nvml)
|
||||
target_link_libraries(tgi_trtllm_backend_impl PUBLIC nlohmann_json::nlohmann_json spdlog::spdlog fmt::fmt)
|
||||
|
||||
# This install all the artifacts in CMAKE_INSTALL_PREFIX under include/ lib/ bin/ to make easy to link / find it back
|
||||
install(TARGETS tgi_trtllm_backend_impl tensorrt_llm nvinfer_plugin_tensorrt_llm decoder_attention executorWorker)
|
||||
install(FILES ${TRTLLM_NVRTC_WRAPPER_LIBRARY_PATH} ${TRTLLM_EXECUTOR_STATIC_LIBRARY_PATH} TYPE LIB)
|
||||
|
||||
#### Unit Tests ####
|
||||
if (${TGI_TRTLLM_BACKEND_BUILD_TESTS})
|
||||
message(STATUS "Building tests")
|
||||
FetchContent_Declare(
|
||||
Catch2
|
||||
GIT_REPOSITORY https://github.com/catchorg/Catch2
|
||||
GIT_TAG v3.6.0
|
||||
)
|
||||
FetchContent_MakeAvailable(Catch2)
|
||||
|
||||
# add_executable(tgi_trtllm_backend_tests tests/infer_test.cpp)
|
||||
# target_link_libraries(tgi_trtllm_backend_tests PRIVATE tgi_trtllm_backend_impl Catch2::Catch2WithMain nlohmann_json::nlohmann_json spdlog::spdlog fmt::fmt CUDA::cudart CUDA::nvml)
|
||||
|
||||
list(APPEND CMAKE_MODULE_PATH ${catch2_SOURCE_DIR}/extras)
|
||||
include(CTest)
|
||||
include(Catch)
|
||||
# catch_discover_tests(tgi_trtllm_backend_tests)
|
||||
endif ()
|
|
@ -0,0 +1,26 @@
|
|||
[package]
|
||||
name = "text-generation-backends-trtllm"
|
||||
version.workspace = true
|
||||
edition.workspace = true
|
||||
authors.workspace = true
|
||||
homepage.workspace = true
|
||||
|
||||
[dependencies]
|
||||
async-trait = "0.1"
|
||||
async-stream = "0.3"
|
||||
cxx = "1.0"
|
||||
text-generation-router = { path = "../../router" }
|
||||
tokenizers = { version = "0.19", features = ["hf-hub"] }
|
||||
tokio = { version = "1.38", features = ["rt", "rt-multi-thread", "parking_lot", "signal", "sync"] }
|
||||
tokio-stream = "0.1.15"
|
||||
clap = { version = "4.5", features = ["derive"] }
|
||||
thiserror = "1.0.62"
|
||||
tracing = "0.1"
|
||||
tracing-opentelemetry = "0.24"
|
||||
tracing-subscriber = { version = "0.3", features = ["json", "env-filter"] }
|
||||
log = { version = "0.4", features = [] }
|
||||
|
||||
[build-dependencies]
|
||||
cmake = "0.1"
|
||||
cxx-build = { version = "1.0", features = ["parallel"] }
|
||||
pkg-config = "0.3"
|
|
@ -0,0 +1,100 @@
|
|||
ARG CUDA_ARCH_LIST="75-real;80-real;86-real;89-real;90-real"
|
||||
ARG OMPI_VERSION="4.1.6"
|
||||
|
||||
# Build dependencies resolver stage
|
||||
FROM lukemathwalker/cargo-chef:latest AS chef
|
||||
WORKDIR /usr/src/text-generation-inference
|
||||
|
||||
FROM chef AS planner
|
||||
COPY . .
|
||||
RUN cargo chef prepare --recipe-path recipe.json
|
||||
|
||||
# CUDA dependent dependencies resolver stage
|
||||
FROM nvidia/cuda:12.5.1-cudnn-devel-ubuntu22.04 AS cuda-builder
|
||||
|
||||
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
||||
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
||||
apt update && apt install -y \
|
||||
build-essential \
|
||||
cmake \
|
||||
curl \
|
||||
gcc \
|
||||
g++ \
|
||||
git \
|
||||
git-lfs \
|
||||
libssl-dev \
|
||||
ninja-build \
|
||||
pkg-config \
|
||||
python3 \
|
||||
python3-setuptools \
|
||||
tar \
|
||||
wget
|
||||
|
||||
ENV TGI_INSTALL_PREFIX=/usr/local/tgi
|
||||
ENV TENSORRT_INSTALL_PREFIX=/usr/local/tensorrt
|
||||
|
||||
# Install OpenMPI
|
||||
FROM cuda-builder AS mpi-builder
|
||||
ARG OMPI_VERSION
|
||||
|
||||
ENV OMPI_TARBALL_FILENAME="openmpi-$OMPI_VERSION.tar.bz2"
|
||||
RUN wget "https://download.open-mpi.org/release/open-mpi/v4.1/$OMPI_TARBALL_FILENAME" -P /opt/src && \
|
||||
mkdir /usr/src/mpi && \
|
||||
tar -xf "/opt/src/$OMPI_TARBALL_FILENAME" -C /usr/src/mpi --strip-components=1 && \
|
||||
cd /usr/src/mpi && \
|
||||
./configure --prefix=/usr/local/mpi --with-cuda=/usr/local/cuda --without-slurm && \
|
||||
make -j all && \
|
||||
make install && \
|
||||
rm -rf "/opt/src/$OMPI_TARBALL_FILENAME"
|
||||
|
||||
# Install TensorRT
|
||||
FROM cuda-builder AS trt-builder
|
||||
COPY backends/trtllm/scripts/install_tensorrt.sh /opt/install_tensorrt.sh
|
||||
RUN chmod +x /opt/install_tensorrt.sh && \
|
||||
/opt/install_tensorrt.sh
|
||||
|
||||
# Build Backend
|
||||
FROM cuda-builder AS tgi-builder
|
||||
WORKDIR /usr/src/text-generation-inference
|
||||
|
||||
# Install Rust
|
||||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | bash -s -- -y && \
|
||||
chmod -R a+w /root/.rustup && \
|
||||
chmod -R a+w /root/.cargo
|
||||
|
||||
ENV PATH="/root/.cargo/bin:$PATH"
|
||||
RUN cargo install cargo-chef
|
||||
|
||||
# Cache dependencies
|
||||
COPY --from=planner /usr/src/text-generation-inference/recipe.json .
|
||||
RUN cargo chef cook --release --recipe-path recipe.json
|
||||
|
||||
# Build actual TGI
|
||||
ARG CUDA_ARCH_LIST
|
||||
ENV CMAKE_PREFIX_PATH="/usr/local/mpi:/usr/local/tensorrt:$CMAKE_PREFIX_PATH"
|
||||
ENV LD_LIBRARY_PATH="/usr/local/mpi/lib:$LD_LIBRARY_PATH"
|
||||
ENV PKG_CONFIG_PATH="/usr/local/mpi/lib/pkgconfig:$PKG_CONFIG_PATH"
|
||||
|
||||
COPY . .
|
||||
COPY --from=trt-builder /usr/local/tensorrt /usr/local/tensorrt
|
||||
COPY --from=mpi-builder /usr/local/mpi /usr/local/mpi
|
||||
RUN mkdir $TGI_INSTALL_PREFIX && mkdir "$TGI_INSTALL_PREFIX/include" && mkdir "$TGI_INSTALL_PREFIX/lib" && \
|
||||
CMAKE_INSTALL_PREFIX=$TGI_INSTALL_PREFIX cargo build --release --bin text-generation-backends-trtllm
|
||||
|
||||
FROM nvidia/cuda:12.5.1-cudnn-runtime-ubuntu22.04 AS runtime
|
||||
WORKDIR /usr/local/tgi/bin
|
||||
|
||||
ENV LD_LIBRARY_PATH="/usr/local/tgi/lib:/usr/local/tensorrt/lib:/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH"
|
||||
|
||||
COPY --from=mpi-builder /usr/local/mpi /usr/local/mpi
|
||||
COPY --from=trt-builder /usr/local/tensorrt /usr/local/tensorrt
|
||||
COPY --from=tgi-builder /usr/local/tgi /usr/local/tgi
|
||||
COPY --from=tgi-builder /usr/src/text-generation-inference/target/release/text-generation-backends-trtllm /usr/local/tgi/bin/text-generation-launcher
|
||||
|
||||
FROM runtime
|
||||
|
||||
LABEL co.huggingface.vendor="Hugging Face Inc."
|
||||
LABEL org.opencontainers.image.authors="hardware@hf.co"
|
||||
|
||||
ENTRYPOINT ["./text-generation-launcher"]
|
||||
CMD ["--executor-worker", "/usr/local/tgi/bin/executorWorker"]
|
|
@ -0,0 +1,46 @@
|
|||
# Text Generation Inference - TensorRT-LLM Backend Implementation
|
||||
|
||||
## Description
|
||||
|
||||
This folder provides the sources of the TensorRT-LLM backend implementation powered by TensorRT-LLM Executor new API
|
||||
|
||||
## Simplified Request Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
actor User
|
||||
participant TextGenerationInference.HttpServer
|
||||
participant TextGenerationInference.TensorRtLlmBackend
|
||||
participant TextGenerationInference.TensorRtLlmWorkerThread
|
||||
participant TensorRtLlm.Executor
|
||||
participant Nvidia.Gpu
|
||||
User ->> TextGenerationInference.HttpServer: POST /generate
|
||||
TextGenerationInference.HttpServer ->> TextGenerationInference.TensorRtLlmBackend: Validate and forward inputs & parameters
|
||||
TextGenerationInference.TensorRtLlmBackend ->> TextGenerationInference.TensorRtLlmWorkerThread: Allocate a new context and spawn a new thread to handle the request
|
||||
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Submit the request to the In-Flight Batcher
|
||||
activate Nvidia.Gpu
|
||||
TensorRtLlm.Executor ->> Nvidia.Gpu: Add the request to the poll for execution
|
||||
TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Response with an unique request identifier
|
||||
rect rgb(10, 92, 54)
|
||||
loop every 100us
|
||||
rect rgb(15, 81, 50)
|
||||
alt Acquire lock to query executor
|
||||
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Poll request number of new token(s) generated
|
||||
else There are new generated tokens
|
||||
TextGenerationInference.TensorRtLlmWorkerThread ->> TensorRtLlm.Executor: Retrieve newly generated tokens
|
||||
TensorRtLlm.Executor -->> TextGenerationInference.TensorRtLlmWorkerThread: Return decoded token information and potential error (omitted)
|
||||
rect rgb(11, 110, 79)
|
||||
alt Generated token is final
|
||||
TensorRtLlm.Executor ->> Nvidia.Gpu: Remove request from the scheduler and from the GPU
|
||||
TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream the remaining decoded tokens and flush the connection
|
||||
else Generated token is not final
|
||||
TextGenerationInference.TensorRtLlmWorkerThread -->> User: Stream token back to the user as they get decoded
|
||||
end
|
||||
end
|
||||
end
|
||||
end
|
||||
deactivate Nvidia.Gpu
|
||||
end
|
||||
end
|
||||
|
||||
```
|
|
@ -0,0 +1,150 @@
|
|||
use cxx_build::CFG;
|
||||
use pkg_config;
|
||||
use std::env;
|
||||
use std::env::consts::ARCH;
|
||||
use std::path::{absolute, PathBuf};
|
||||
|
||||
const ADDITIONAL_BACKEND_LINK_LIBRARIES: [&str; 2] = ["spdlog", "fmt"];
|
||||
const CUDA_ARCH_LIST: Option<&str> = option_env!("CUDA_ARCH_LIST");
|
||||
const CUDA_REQUIRED_VERSION: &str = "12.5";
|
||||
const MPI_REQUIRED_VERSION: &str = "4.1";
|
||||
const INSTALL_PREFIX: Option<&str> = option_env!("CMAKE_INSTALL_PREFIX");
|
||||
const TENSORRT_ROOT_DIR: Option<&str> = option_env!("TENSORRT_ROOT_DIR");
|
||||
const NCCL_ROOT_DIR: Option<&str> = option_env!("NCCL_ROOT_DIR");
|
||||
|
||||
// Dependencies
|
||||
const BACKEND_DEPS: [&str; 2] = ["tgi_trtllm_backend_impl", "tgi_trtllm_backend"];
|
||||
const CUDA_TRANSITIVE_DEPS: [&str; 4] = ["cuda", "cudart", "cublas", "nvidia-ml"];
|
||||
const TENSORRT_LLM_TRANSITIVE_DEPS: [(&str, &str); 5] = [
|
||||
("dylib", "tensorrt_llm"),
|
||||
("static", "tensorrt_llm_executor_static"),
|
||||
("dylib", "tensorrt_llm_nvrtc_wrapper"),
|
||||
("dylib", "nvinfer_plugin_tensorrt_llm"),
|
||||
("dylib", "decoder_attention"),
|
||||
];
|
||||
|
||||
macro_rules! probe {
|
||||
($name: expr, $version: expr) => {
|
||||
if let Err(_) = pkg_config::probe_library($name) {
|
||||
pkg_config::probe_library(&format!("{}-{}", $name, $version))
|
||||
.expect(&format!("Failed to locate {}", $name));
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
fn build_backend(is_debug: bool, opt_level: &str, out_dir: &PathBuf) -> (PathBuf, PathBuf) {
|
||||
// Build the backend implementation through CMake
|
||||
let install_path = INSTALL_PREFIX.unwrap_or("/usr/local/tgi");
|
||||
let tensorrt_path = TENSORRT_ROOT_DIR.unwrap_or("/usr/local/tensorrt");
|
||||
let cuda_arch_list = CUDA_ARCH_LIST.unwrap_or("90-real"); // Hopper by default
|
||||
|
||||
let mut install_path = PathBuf::from(install_path);
|
||||
if !install_path.is_absolute() {
|
||||
install_path = absolute(out_dir).expect("cannot happen").join(install_path);
|
||||
}
|
||||
|
||||
let _ = cmake::Config::new(".")
|
||||
.uses_cxx11()
|
||||
.generator("Ninja")
|
||||
.profile(match is_debug {
|
||||
true => "Debug",
|
||||
false => "Release",
|
||||
})
|
||||
.env("OPT_LEVEL", opt_level)
|
||||
.define("CMAKE_INSTALL_PREFIX", &install_path)
|
||||
.define("CMAKE_CUDA_COMPILER", "/usr/local/cuda/bin/nvcc")
|
||||
.define("TGI_TRTLLM_BACKEND_TARGET_CUDA_ARCH_LIST", cuda_arch_list)
|
||||
.define("TGI_TRTLLM_BACKEND_TRT_ROOT", tensorrt_path)
|
||||
.build();
|
||||
|
||||
// Additional transitive CMake dependencies
|
||||
let deps_folder = out_dir.join("build").join("_deps");
|
||||
for dependency in ADDITIONAL_BACKEND_LINK_LIBRARIES {
|
||||
let dep_name = match is_debug {
|
||||
true => format!("{}d", dependency),
|
||||
false => String::from(dependency),
|
||||
};
|
||||
let dep_path = deps_folder.join(format!("{}-build", dependency));
|
||||
println!("cargo:rustc-link-search={}", dep_path.display());
|
||||
println!("cargo:rustc-link-lib=static={}", dep_name);
|
||||
}
|
||||
|
||||
// Emit linkage information from the artifacts we just built
|
||||
let install_lib_path = install_path.join("lib");
|
||||
|
||||
println!(
|
||||
r"cargo:warning=Adding link search path: {}",
|
||||
install_lib_path.display()
|
||||
);
|
||||
println!(r"cargo:rustc-link-search={}", install_lib_path.display());
|
||||
|
||||
(PathBuf::from(install_path), deps_folder)
|
||||
}
|
||||
|
||||
fn build_ffi_layer(deps_folder: &PathBuf) {
|
||||
CFG.include_prefix = "backends/trtllm";
|
||||
cxx_build::bridge("src/lib.rs")
|
||||
.static_flag(true)
|
||||
.include(deps_folder.join("fmt-src").join("include"))
|
||||
.include(deps_folder.join("spdlog-src").join("include"))
|
||||
.include(deps_folder.join("json-src").join("include"))
|
||||
.include(deps_folder.join("trtllm-src").join("cpp").join("include"))
|
||||
.include("/usr/local/cuda/include")
|
||||
.include("/usr/local/tensorrt/include")
|
||||
.file("src/ffi.cpp")
|
||||
.std("c++20")
|
||||
.compile("tgi_trtllm_backend");
|
||||
|
||||
println!("cargo:rerun-if-changed=CMakeLists.txt");
|
||||
println!("cargo:rerun-if-changed=include/backend.h");
|
||||
println!("cargo:rerun-if-changed=lib/backend.cpp");
|
||||
println!("cargo:rerun-if-changed=include/ffi.h");
|
||||
println!("cargo:rerun-if-changed=src/ffi.cpp");
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Misc variables
|
||||
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
|
||||
let build_profile = env::var("PROFILE").unwrap();
|
||||
let (is_debug, opt_level) = match build_profile.as_ref() {
|
||||
"debug" => (true, "0"),
|
||||
_ => (false, "3"),
|
||||
};
|
||||
|
||||
// Build the backend
|
||||
let (_backend_path, deps_folder) = build_backend(is_debug, opt_level, &out_dir);
|
||||
|
||||
// Build the FFI layer calling the backend above
|
||||
build_ffi_layer(&deps_folder);
|
||||
|
||||
// Emit linkage search path
|
||||
probe!("ompi", MPI_REQUIRED_VERSION);
|
||||
|
||||
// Probe CUDA & co. with pkg-config
|
||||
CUDA_TRANSITIVE_DEPS.iter().for_each(|name| {
|
||||
probe!(name, CUDA_REQUIRED_VERSION);
|
||||
});
|
||||
|
||||
// NCCL is slightly trickier because it might not have a pkgconfig installed
|
||||
let nccl_library_path_default = format!("/usr/local/{}-linux-gnu", ARCH);
|
||||
let nccl_library_path = NCCL_ROOT_DIR.unwrap_or(&nccl_library_path_default);
|
||||
println!(r"cargo:rustc-link-search=native={}", nccl_library_path);
|
||||
println!("cargo:rustc-link-lib=dylib=nccl");
|
||||
|
||||
// TensorRT
|
||||
let tensort_library_path = TENSORRT_ROOT_DIR.unwrap_or("/usr/local/tensorrt/lib");
|
||||
println!(r"cargo:rustc-link-search=native={}", tensort_library_path);
|
||||
println!("cargo:rustc-link-lib=dylib=nvinfer");
|
||||
|
||||
// TensorRT-LLM
|
||||
TENSORRT_LLM_TRANSITIVE_DEPS
|
||||
.iter()
|
||||
.for_each(|(link_type, name)| {
|
||||
println!("cargo:rustc-link-lib={}={}", link_type, name);
|
||||
});
|
||||
|
||||
// Backend
|
||||
BACKEND_DEPS.iter().for_each(|name| {
|
||||
println!("cargo:rustc-link-lib=static={}", name);
|
||||
});
|
||||
}
|
|
@ -0,0 +1,6 @@
|
|||
FetchContent_Declare(
|
||||
fmt
|
||||
GIT_REPOSITORY https://github.com/fmtlib/fmt
|
||||
GIT_TAG 11.0.1
|
||||
)
|
||||
FetchContent_MakeAvailable(fmt)
|
|
@ -0,0 +1,5 @@
|
|||
fetchcontent_declare(
|
||||
json
|
||||
URL https://github.com/nlohmann/json/releases/download/v3.11.3/json.tar.xz
|
||||
)
|
||||
fetchcontent_makeavailable(json)
|
|
@ -0,0 +1,17 @@
|
|||
set(SPDLOG_USE_FMT ON)
|
||||
set(SPDLOG_BUILD_SHARED OFF)
|
||||
set(SPDLOG_FMT_EXTERNAL ON)
|
||||
|
||||
# Define the level at which SPDLOG_ compilation level is defined
|
||||
if (${CMAKE_BUILD_TYPE} STREQUAL "Debug")
|
||||
add_compile_definitions(SPDLOG_ACTIVE_LEVEL SPDLOG_LEVEL_DEBUG)
|
||||
else ()
|
||||
add_compile_definitions(SPDLOG_ACTIVE_LEVEL SPDLOG_LEVEL_INFO)
|
||||
endif ()
|
||||
|
||||
fetchcontent_declare(
|
||||
spdlog
|
||||
GIT_REPOSITORY https://github.com/gabime/spdlog.git
|
||||
GIT_TAG v1.14.1
|
||||
)
|
||||
fetchcontent_makeavailable(spdlog)
|
|
@ -0,0 +1,42 @@
|
|||
set(TRT_INCLUDE_DIR ${TGI_TRTLLM_BACKEND_TRT_INCLUDE_DIR})
|
||||
set(TRT_LIB_DIR ${TGI_TRTLLM_BACKEND_TRT_LIB_DIR})
|
||||
|
||||
set(USE_CXX11_ABI ON)
|
||||
set(BUILD_PYT OFF)
|
||||
set(BUILD_PYBIND OFF)
|
||||
set(BUILD_MICRO_BENCHMARKS OFF)
|
||||
set(BUILD_BENCHMARKS OFF)
|
||||
set(BUILD_TESTS OFF)
|
||||
set(CMAKE_CUDA_ARCHITECTURES ${TGI_TRTLLM_BACKEND_TARGET_CUDA_ARCH_LIST})
|
||||
|
||||
message(STATUS "Building for CUDA Architectures: ${CMAKE_CUDA_ARCHITECTURES}")
|
||||
|
||||
if (${CMAKE_BUILD_TYPE} STREQUAL "Debug")
|
||||
set(FAST_BUILD ON)
|
||||
set(NVTX_DISABLE OFF)
|
||||
else ()
|
||||
set(FAST_BUILD OFF)
|
||||
set(FAST_MATH ON)
|
||||
set(NVTX_DISABLE ON)
|
||||
endif ()
|
||||
|
||||
fetchcontent_declare(
|
||||
trtllm
|
||||
GIT_REPOSITORY https://github.com/NVIDIA/TensorRT-LLM.git
|
||||
GIT_TAG a681853d3803ee5893307e812530b5e7004bb6e1
|
||||
GIT_SHALLOW FALSE
|
||||
)
|
||||
fetchcontent_makeavailable(trtllm)
|
||||
|
||||
message(STATUS "Found TensorRT-LLM: ${trtllm_SOURCE_DIR}")
|
||||
execute_process(COMMAND git lfs install WORKING_DIRECTORY "${trtllm_SOURCE_DIR}/")
|
||||
execute_process(COMMAND git lfs pull WORKING_DIRECTORY "${trtllm_SOURCE_DIR}/")
|
||||
|
||||
# TRTLLM use a JIT based *precompiled* library to generate some specific kernels, we are generating the path to this one here
|
||||
set(TRTLLM_NVRTC_LIBRARY_NAME "${CMAKE_SHARED_LIBRARY_PREFIX}tensorrt_llm_nvrtc_wrapper${CMAKE_SHARED_LIBRARY_SUFFIX}" CACHE INTERNAL "nvrtc wrapper library name")
|
||||
set(TRTLLM_NVRTC_WRAPPER_LIBRARY_PATH "${trtllm_SOURCE_DIR}/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/${CMAKE_LIBRARY_ARCHITECTURE}/${TRTLLM_NVRTC_LIBRARY_NAME}"
|
||||
CACHE INTERNAL "nvrtc wrapper library path")
|
||||
|
||||
# The same Executor Static library
|
||||
set(TRTLLM_EXECUTOR_STATIC_LIBRARY_NAME "${CMAKE_SHARED_LIBRARY_PREFIX}tensorrt_llm_executor_static${CMAKE_STATIC_LIBRARY_SUFFIX}" CACHE INTERNAL "executor_static library name")
|
||||
set(TRTLLM_EXECUTOR_STATIC_LIBRARY_PATH "${trtllm_SOURCE_DIR}/cpp/tensorrt_llm/executor/${CMAKE_LIBRARY_ARCHITECTURE}/${TRTLLM_EXECUTOR_STATIC_LIBRARY_NAME}" CACHE INTERNAL "executor_static library path")
|
|
@ -0,0 +1,121 @@
|
|||
//
|
||||
// Created by Morgan Funtowicz on 6/30/24.
|
||||
//
|
||||
|
||||
#ifndef TGI_TRTLLM_BACKEND_H
|
||||
#define TGI_TRTLLM_BACKEND_H
|
||||
|
||||
#include <cmath>
|
||||
#include <filesystem>
|
||||
#include <span>
|
||||
#include <vector>
|
||||
|
||||
#include <nlohmann/json.hpp>
|
||||
|
||||
#include <tensorrt_llm/runtime/common.h>
|
||||
#include <tensorrt_llm/executor/executor.h>
|
||||
#include <tensorrt_llm/plugins/api/tllmPlugin.h>
|
||||
|
||||
using json = nlohmann::json;
|
||||
namespace tle = tensorrt_llm::executor;
|
||||
|
||||
namespace huggingface::tgi::backends {
|
||||
using RequestId = tle::IdType;
|
||||
using TokenId = tle::TokenIdType;
|
||||
|
||||
/**
|
||||
* Initialize all the components required by TRTLLM.
|
||||
* It is required to call this function before attempting to load any engine
|
||||
*/
|
||||
void InitializeBackend();
|
||||
|
||||
/**
|
||||
*
|
||||
* @param config TensorRT-LLM configuration object
|
||||
* @param workerPath Path to the "executorWorker" provided by TensorRT-LLM when using orchestrator mode
|
||||
* @return
|
||||
*/
|
||||
tle::ExecutorConfig GetExecutorConfig(const json &config, const std::string &workerPath);
|
||||
|
||||
/**
|
||||
* Get the sampling configuration from the parameters provided by TGI
|
||||
* @param topK
|
||||
* @param topP
|
||||
* @param temperature
|
||||
* @param repetition_penalty
|
||||
* @param frequency_penalty
|
||||
* @param seed
|
||||
* @return
|
||||
*/
|
||||
tle::SamplingConfig GetSamplingConfig(
|
||||
uint32_t topK,
|
||||
float_t topP,
|
||||
float_t temperature,
|
||||
float_t repetition_penalty,
|
||||
float_t frequency_penalty,
|
||||
uint64_t seed
|
||||
);
|
||||
|
||||
/**
|
||||
*
|
||||
*/
|
||||
class TensorRtLlmBackend {
|
||||
private:
|
||||
const json config;
|
||||
tle::Executor executor;
|
||||
|
||||
public:
|
||||
explicit TensorRtLlmBackend(
|
||||
const std::filesystem::path &engineFolder,
|
||||
const std::filesystem::path &executorWorker
|
||||
);
|
||||
|
||||
/**
|
||||
* Indicate if the backend is ready to accept incoming request
|
||||
* @return true if ready, false otherwise
|
||||
*/
|
||||
[[nodiscard]] bool IsReady() const;
|
||||
|
||||
/**
|
||||
* Query the executor for the number of token available for pulling
|
||||
* @return
|
||||
*/
|
||||
[[nodiscard]] size_t NumResponsesReady() const;
|
||||
|
||||
/**
|
||||
* Submit a new generation task to the executor
|
||||
* @param tokens
|
||||
* @param topK
|
||||
* @param topP
|
||||
* @param temperature
|
||||
* @param repetition_penalty
|
||||
* @param frequency_penalty
|
||||
* @param seed
|
||||
* @return Request id related to this generation for reference
|
||||
*/
|
||||
[[nodiscard]] RequestId Submit(
|
||||
const std::vector<TokenId> &tokens,
|
||||
int32_t topK,
|
||||
float_t topP,
|
||||
float_t temperature,
|
||||
float_t repetition_penalty,
|
||||
float_t frequency_penalty,
|
||||
uint64_t seed
|
||||
);
|
||||
|
||||
/**
|
||||
*
|
||||
* @param requestId The request id to poll the generation results
|
||||
* @return
|
||||
*/
|
||||
std::vector<tle::Response> Poll(RequestId requestId);
|
||||
|
||||
/**
|
||||
* Stop the underlying executor
|
||||
*/
|
||||
void Shutdown();
|
||||
};
|
||||
}
|
||||
|
||||
|
||||
#endif //TGI_TRTLLM_BACKEND_H
|
|
@ -0,0 +1,75 @@
|
|||
//
|
||||
// Created by mfuntowicz on 7/11/24.
|
||||
//
|
||||
|
||||
#ifndef TGI_TRTLLM_BACKEND_FFI_H
|
||||
#define TGI_TRTLLM_BACKEND_FFI_H
|
||||
|
||||
#include <cstddef>
|
||||
#include "backend.h"
|
||||
|
||||
namespace huggingface::tgi::backends {
|
||||
class TensorRtLlmBackendImpl;
|
||||
}
|
||||
|
||||
#include "backends/trtllm/src/lib.rs.h"
|
||||
|
||||
|
||||
namespace huggingface::tgi::backends {
|
||||
|
||||
// struct GenerationContext;
|
||||
|
||||
class TensorRtLlmBackendImpl : public TensorRtLlmBackend {
|
||||
public:
|
||||
/***
|
||||
*
|
||||
* @param engineFolder
|
||||
* @param executorWorker
|
||||
*/
|
||||
TensorRtLlmBackendImpl(const std::string_view &engineFolder, const std::string_view &executorWorker);
|
||||
|
||||
/***
|
||||
*
|
||||
* @return
|
||||
*/
|
||||
bool IsReady() const;
|
||||
|
||||
/***
|
||||
*
|
||||
* @param tokens
|
||||
* @param topK
|
||||
* @param topP
|
||||
* @param temperature
|
||||
* @param repetition_penalty
|
||||
* @param frequency_penalty
|
||||
* @param seed
|
||||
* @return
|
||||
*/
|
||||
[[nodiscard("returned request id should be used to refer to the request's generation result later on")]]
|
||||
uint64_t
|
||||
Submit(rust::Slice<const uint32_t> tokens, int32_t topK, float_t topP, float_t temperature,
|
||||
float_t repetition_penalty, float_t frequency_penalty, uint64_t seed);
|
||||
|
||||
/***
|
||||
*
|
||||
* @param requestId
|
||||
* @param ctx
|
||||
* @param callback
|
||||
* @return
|
||||
*/
|
||||
size_t StreamTokens(
|
||||
const RequestId requestId,
|
||||
huggingface::tgi::backends::GenerationContext *ctx,
|
||||
rust::Fn<void(huggingface::tgi::backends::GenerationContext *,
|
||||
huggingface::tgi::backends::GenerationStep)> callback);
|
||||
};
|
||||
|
||||
/***
|
||||
*
|
||||
* @param engineFolder
|
||||
* @return
|
||||
*/
|
||||
std::unique_ptr<TensorRtLlmBackendImpl> CreateTensorRtLlmBackend(rust::Str engineFolder, rust::Str executorWorker);
|
||||
}
|
||||
|
||||
#endif //TGI_TRTLLM_BACKEND_FFI_H
|
|
@ -0,0 +1,59 @@
|
|||
//
|
||||
// Created by mfuntowicz on 7/23/24.
|
||||
//
|
||||
|
||||
#ifndef TGI_TRTLLM_BACKEND_HARDWARE_H
|
||||
#define TGI_TRTLLM_BACKEND_HARDWARE_H
|
||||
|
||||
#include <cstdint>
|
||||
#include <limits>
|
||||
#include <fmt/base.h>
|
||||
#include <spdlog/spdlog.h>
|
||||
#include <nvml.h>
|
||||
|
||||
namespace huggingface::hardware::cuda {
|
||||
|
||||
#define AMPERE_SM_MAJOR 8
|
||||
#define HOPPER_SM_MAJOR 8
|
||||
|
||||
/**
|
||||
* Store information about the version of the CUDA Compute Capabilities detected on the device
|
||||
*/
|
||||
struct CudaComputeCapabilities {
|
||||
int32_t major;
|
||||
int32_t minor;
|
||||
|
||||
[[nodiscard]] constexpr bool isPostAmpere() const { return major >= AMPERE_SM_MAJOR; }
|
||||
|
||||
[[nodiscard]] constexpr bool isPostHopper() const { return major >= HOPPER_SM_MAJOR; }
|
||||
};
|
||||
|
||||
CudaComputeCapabilities GetCudaComputeCapabilities() {
|
||||
// Get the compute capabilities of the current hardware
|
||||
nvmlDevice_t device;
|
||||
CudaComputeCapabilities capabilities{0, 0};
|
||||
if (nvmlDeviceGetHandleByIndex_v2(0, &device) == NVML_SUCCESS) {
|
||||
SPDLOG_DEBUG("Successfully acquired nvmlDevice_t = 0");
|
||||
if (nvmlDeviceGetCudaComputeCapability(device, &capabilities.major, &capabilities.minor) == NVML_SUCCESS) {
|
||||
SPDLOG_INFO("Detected sm_{:d}{:d} compute capabilities", capabilities.major, capabilities.minor);
|
||||
}
|
||||
}
|
||||
|
||||
return capabilities;
|
||||
}
|
||||
|
||||
/**
|
||||
* Return the number of GPU detected. If no GPU is detected, return size_t::max()
|
||||
* @return
|
||||
*/
|
||||
std::optional<size_t> GetNumDevices() {
|
||||
uint32_t numGpus = 0;
|
||||
if (nvmlDeviceGetCount_v2(&numGpus) == NVML_SUCCESS) {
|
||||
return std::optional(numGpus);
|
||||
} else {
|
||||
return std::nullopt;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#endif //TGI_TRTLLM_BACKEND_HARDWARE_H
|
|
@ -0,0 +1,146 @@
|
|||
#include <fstream>
|
||||
|
||||
#include <fmt/ranges.h>
|
||||
#include <spdlog/spdlog.h>
|
||||
#include <nvml.h>
|
||||
|
||||
#include "backend.h"
|
||||
#include "hardware.h"
|
||||
|
||||
void huggingface::tgi::backends::InitializeBackend() {
|
||||
SPDLOG_INFO("Initializing Backend...");
|
||||
nvmlInit_v2();
|
||||
initTrtLlmPlugins();
|
||||
|
||||
const auto numGpus = huggingface::hardware::cuda::GetNumDevices();
|
||||
if (numGpus.has_value()) {
|
||||
SPDLOG_INFO("Detected {:d} Nvidia GPU(s)", numGpus.value());
|
||||
} else {
|
||||
SPDLOG_WARN("Failed to detected Nvidia GPU(s) on the system");
|
||||
}
|
||||
}
|
||||
|
||||
[[nodiscard]]
|
||||
tle::ExecutorConfig huggingface::tgi::backends::GetExecutorConfig(const json &config, const std::string &workerPath) {
|
||||
tle::ExecutorConfig execConfig(1);
|
||||
|
||||
// Retrieve the compute capabilities to enable some options at runtime
|
||||
const auto computeCapabilities = huggingface::hardware::cuda::GetCudaComputeCapabilities();
|
||||
|
||||
// Single engine (TP = PP = 1) -> using leader mode (no MPI involved)
|
||||
if (config["/pretrained_config/mapping/world_size"_json_pointer].get<uint8_t>() == 1) {
|
||||
SPDLOG_INFO("Detected single engine deployment, using leader mode");
|
||||
execConfig.setParallelConfig(tle::ParallelConfig(
|
||||
tle::CommunicationType::kMPI,
|
||||
tle::CommunicationMode::kLEADER,
|
||||
std::nullopt,
|
||||
std::nullopt,
|
||||
std::nullopt
|
||||
));
|
||||
} else { // Multiple engines -> using orchestrator mode (MPI involved)
|
||||
SPDLOG_INFO("Detected sharded engine deployment, using orchestrator mode");
|
||||
execConfig.setParallelConfig(tle::ParallelConfig(
|
||||
tle::CommunicationType::kMPI,
|
||||
tle::CommunicationMode::kORCHESTRATOR,
|
||||
std::nullopt,
|
||||
std::nullopt,
|
||||
tle::OrchestratorConfig(true, workerPath, nullptr, true)
|
||||
));
|
||||
}
|
||||
|
||||
// Define some configuration variables
|
||||
execConfig.setKvCacheConfig(tle::KvCacheConfig(true));
|
||||
execConfig.setEnableChunkedContext(computeCapabilities.isPostAmpere());
|
||||
return execConfig;
|
||||
}
|
||||
|
||||
tle::SamplingConfig huggingface::tgi::backends::GetSamplingConfig(
|
||||
uint32_t topK,
|
||||
float_t topP,
|
||||
float_t temperature,
|
||||
float_t repetition_penalty,
|
||||
float_t frequency_penalty,
|
||||
uint64_t seed) {
|
||||
return tle::SamplingConfig(
|
||||
1, // TGI only use a single beam
|
||||
topK,
|
||||
topP,
|
||||
std::nullopt,
|
||||
std::nullopt,
|
||||
std::nullopt,
|
||||
seed,
|
||||
temperature,
|
||||
temperature,
|
||||
std::nullopt,
|
||||
repetition_penalty,
|
||||
std::nullopt,
|
||||
frequency_penalty
|
||||
);
|
||||
}
|
||||
|
||||
huggingface::tgi::backends::TensorRtLlmBackend::TensorRtLlmBackend(
|
||||
const std::filesystem::path &enginesFolder,
|
||||
const std::filesystem::path &executorWorker
|
||||
) :
|
||||
config(json::parse(std::ifstream(enginesFolder / "config.json"))),
|
||||
executor(
|
||||
enginesFolder,
|
||||
tensorrt_llm::executor::ModelType::kDECODER_ONLY,
|
||||
GetExecutorConfig(config, executorWorker.string()
|
||||
)) {
|
||||
SPDLOG_INFO(FMT_STRING("Engine (version={})"), config["/version"_json_pointer].get_ref<const std::string &>());
|
||||
}
|
||||
|
||||
bool huggingface::tgi::backends::TensorRtLlmBackend::IsReady() const {
|
||||
return executor.canEnqueueRequests();
|
||||
}
|
||||
|
||||
[[nodiscard("Returned number of requests needs to be consumed")]]
|
||||
size_t huggingface::tgi::backends::TensorRtLlmBackend::NumResponsesReady() const {
|
||||
return executor.getNumResponsesReady();
|
||||
}
|
||||
|
||||
[[nodiscard("Returned request id needs to be provided back to gather generated tokens")]]
|
||||
tle::IdType huggingface::tgi::backends::TensorRtLlmBackend::Submit(
|
||||
const std::vector<tle::TokenIdType> &tokens,
|
||||
const int32_t topK,
|
||||
const float_t topP,
|
||||
const float_t temperature,
|
||||
const float_t repetition_penalty,
|
||||
const float_t frequency_penalty,
|
||||
const uint64_t seed
|
||||
) {
|
||||
#ifdef NDEBUG
|
||||
SPDLOG_DEBUG(
|
||||
FMT_STRING("Submitting inference over {:d} tokens to the executor ({:d} already in-flight)"),
|
||||
tokens.size(),
|
||||
executor.getLatestIterationStats().back().numActiveRequests
|
||||
);
|
||||
#else
|
||||
SPDLOG_DEBUG(
|
||||
FMT_STRING("Submitting inference [{}] to the executor ({:d} already in-flight)"),
|
||||
fmt::join(tokens, ", "),
|
||||
executor.getLatestIterationStats().front().numActiveRequests
|
||||
);
|
||||
#endif
|
||||
|
||||
const auto maxNumTokens = config["/build_config/max_num_tokens"_json_pointer].get<size_t>();
|
||||
const auto maxNewTokens = static_cast<int32_t>(std::max(1ul, maxNumTokens - tokens.size()));
|
||||
|
||||
const auto sampling = GetSamplingConfig(topK, topP, temperature, repetition_penalty, frequency_penalty, seed);
|
||||
const auto output = tle::OutputConfig(true, false, false, true, false);
|
||||
return executor.enqueueRequest(
|
||||
tle::Request{tokens, maxNewTokens, true, sampling, output});
|
||||
}
|
||||
|
||||
[[nodiscard("Generated tokens result must be used")]]
|
||||
std::vector<tle::Response> huggingface::tgi::backends::TensorRtLlmBackend::Poll(const tle::IdType requestId) {
|
||||
SPDLOG_DEBUG(FMT_STRING("Polling status for request {:d}"), requestId);
|
||||
return executor.awaitResponses(requestId);
|
||||
}
|
||||
|
||||
|
||||
void huggingface::tgi::backends::TensorRtLlmBackend::Shutdown() {
|
||||
SPDLOG_INFO("Shutting down executor");
|
||||
executor.shutdown();
|
||||
}
|
|
@ -0,0 +1,111 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -ex
|
||||
|
||||
TRT_VER="10.2.0.19"
|
||||
CUDA_VER="12.5"
|
||||
CUDNN_VER="9.2.1.18-1"
|
||||
NCCL_VER="2.22.3-1+cuda12.5"
|
||||
CUBLAS_VER="12.5.3.2-1"
|
||||
NVRTC_VER="12.5.82-1"
|
||||
|
||||
for i in "$@"; do
|
||||
case $i in
|
||||
--TRT_VER=?*) TRT_VER="${i#*=}";;
|
||||
--CUDA_VER=?*) CUDA_VER="${i#*=}";;
|
||||
--CUDNN_VER=?*) CUDNN_VER="${i#*=}";;
|
||||
--NCCL_VER=?*) NCCL_VER="${i#*=}";;
|
||||
--CUBLAS_VER=?*) CUBLAS_VER="${i#*=}";;
|
||||
*) ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
NVCC_VERSION_OUTPUT=$(nvcc --version)
|
||||
if [[ $(echo $NVCC_VERSION_OUTPUT | grep -oP "\d+\.\d+" | head -n 1) != ${CUDA_VER} ]]; then
|
||||
echo "The version of pre-installed CUDA is not equal to ${CUDA_VER}."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
install_ubuntu_requirements() {
|
||||
apt-get update && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates
|
||||
ARCH=$(uname -m)
|
||||
if [ "$ARCH" = "amd64" ];then ARCH="x86_64";fi
|
||||
if [ "$ARCH" = "aarch64" ];then ARCH="sbsa";fi
|
||||
curl -fsSLO https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/${ARCH}/cuda-keyring_1.0-1_all.deb
|
||||
dpkg -i cuda-keyring_1.0-1_all.deb
|
||||
|
||||
apt-get update
|
||||
if [[ $(apt list --installed | grep libcudnn9) ]]; then
|
||||
apt-get remove --purge -y --allow-change-held-packages libcudnn9*
|
||||
fi
|
||||
if [[ $(apt list --installed | grep libnccl) ]]; then
|
||||
apt-get remove --purge -y --allow-change-held-packages libnccl*
|
||||
fi
|
||||
if [[ $(apt list --installed | grep libcublas) ]]; then
|
||||
apt-get remove --purge -y --allow-change-held-packages libcublas*
|
||||
fi
|
||||
if [[ $(apt list --installed | grep cuda-nvrtc-dev) ]]; then
|
||||
apt-get remove --purge -y --allow-change-held-packages cuda-nvrtc-dev*
|
||||
fi
|
||||
CUBLAS_CUDA_VERSION=$(echo $CUDA_VER | sed 's/\./-/g')
|
||||
apt-get install -y --no-install-recommends libcudnn9-cuda-12=${CUDNN_VER} libcudnn9-dev-cuda-12=${CUDNN_VER}
|
||||
apt-get install -y --no-install-recommends libnccl2=${NCCL_VER} libnccl-dev=${NCCL_VER}
|
||||
apt-get install -y --no-install-recommends libcublas-${CUBLAS_CUDA_VERSION}=${CUBLAS_VER} libcublas-dev-${CUBLAS_CUDA_VERSION}=${CUBLAS_VER}
|
||||
# NVRTC static library doesn't exist in NGC PyTorch container.
|
||||
NVRTC_CUDA_VERSION=$(echo $CUDA_VER | sed 's/\./-/g')
|
||||
apt-get install -y --no-install-recommends cuda-nvrtc-dev-${NVRTC_CUDA_VERSION}=${NVRTC_VER}
|
||||
apt-get clean
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
}
|
||||
|
||||
install_centos_requirements() {
|
||||
CUBLAS_CUDA_VERSION=$(echo $CUDA_VER | sed 's/\./-/g')
|
||||
yum -y update
|
||||
yum -y install epel-release
|
||||
yum remove -y libnccl* && yum -y install libnccl-${NCCL_VER} libnccl-devel-${NCCL_VER}
|
||||
yum remove -y libcublas* && yum -y install libcublas-${CUBLAS_CUDA_VERSION}-${CUBLAS_VER} libcublas-devel-${CUBLAS_CUDA_VERSION}-${CUBLAS_VER}
|
||||
yum clean all
|
||||
}
|
||||
|
||||
install_tensorrt() {
|
||||
#PY_VERSION=$(python3 -c 'import sys; print(".".join(map(str, sys.version_info[0:2])))')
|
||||
#PARSED_PY_VERSION=$(echo "${PY_VERSION//./}")
|
||||
TRT_CUDA_VERSION="12.5"
|
||||
|
||||
if [ -z "$RELEASE_URL_TRT" ];then
|
||||
ARCH=${TRT_TARGETARCH}
|
||||
if [ -z "$ARCH" ];then ARCH=$(uname -m);fi
|
||||
if [ "$ARCH" = "arm64" ];then ARCH="aarch64";fi
|
||||
if [ "$ARCH" = "amd64" ];then ARCH="x86_64";fi
|
||||
if [ "$ARCH" = "x86_64" ];then DIR_NAME="x64-agnostic"; else DIR_NAME=${ARCH};fi
|
||||
if [ "$ARCH" = "aarch64" ];then OS1="Ubuntu22_04" && OS2="Ubuntu-22.04" && OS="ubuntu-22.04"; else OS1="Linux" && OS2="Linux" && OS="linux";fi
|
||||
RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.2.0/tars/TensorRT-${TRT_VER}.${OS2}.${ARCH}-gnu.cuda-${TRT_CUDA_VERSION}.tar.gz
|
||||
fi
|
||||
wget --no-verbose ${RELEASE_URL_TRT} -O /tmp/TensorRT.tar
|
||||
tar -xf /tmp/TensorRT.tar -C /usr/local/
|
||||
mv /usr/local/TensorRT-${TRT_VER} /usr/local/tensorrt
|
||||
# pip3 install /usr/local/tensorrt/python/tensorrt-*-cp${PARSED_PY_VERSION}-*.whl
|
||||
rm -rf /tmp/TensorRT.tar
|
||||
}
|
||||
|
||||
# Install base packages depending on the base OS
|
||||
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
|
||||
case "$ID" in
|
||||
debian)
|
||||
install_ubuntu_requirements
|
||||
install_tensorrt
|
||||
;;
|
||||
ubuntu)
|
||||
install_ubuntu_requirements
|
||||
install_tensorrt
|
||||
;;
|
||||
centos)
|
||||
install_centos_requirements
|
||||
install_tensorrt
|
||||
;;
|
||||
*)
|
||||
echo "Unable to determine OS..."
|
||||
exit 1
|
||||
;;
|
||||
esac
|
|
@ -0,0 +1,329 @@
|
|||
use std::future::Future;
|
||||
use std::path::Path;
|
||||
use std::pin::{pin, Pin};
|
||||
use std::str::FromStr;
|
||||
use std::sync::atomic::{AtomicBool, Ordering};
|
||||
use std::sync::{Arc, OnceLock};
|
||||
use std::task::{Context, Poll};
|
||||
use std::time::Duration;
|
||||
|
||||
use async_trait::async_trait;
|
||||
use cxx::UniquePtr;
|
||||
use log::{error, warn};
|
||||
use tokenizers::Tokenizer;
|
||||
use tokio::sync::mpsc::{unbounded_channel, UnboundedSender};
|
||||
use tokio::sync::RwLock;
|
||||
use tokio::time::{sleep, Instant};
|
||||
use tokio_stream::wrappers::UnboundedReceiverStream;
|
||||
use tokio_stream::{Stream, StreamExt};
|
||||
use tracing::{instrument, span, Level};
|
||||
|
||||
use text_generation_router::infer::{Backend, GeneratedText, InferError, InferStreamResponse};
|
||||
use text_generation_router::validation::ValidationError::UnsupportedModality;
|
||||
use text_generation_router::validation::{Chunk, ValidGenerateRequest, ValidationError};
|
||||
use text_generation_router::{FinishReason, Token};
|
||||
|
||||
use crate::errors::TensorRtLlmBackendError;
|
||||
use crate::ffi::{create_tensorrt_llm_backend, GenerationStep, TensorRtLlmBackendImpl};
|
||||
|
||||
// Value used to poll the state of the generation stream
|
||||
static POLLING_INTERVAL_US: OnceLock<u64> = OnceLock::new();
|
||||
|
||||
type InferResult<T> = Result<T, InferError>;
|
||||
|
||||
pub(crate) struct Generation {
|
||||
executor: Arc<RwLock<UniquePtr<TensorRtLlmBackendImpl>>>,
|
||||
done: Arc<AtomicBool>,
|
||||
}
|
||||
|
||||
/// Holds the user provided input to be executed along with a channel allowing
|
||||
/// to bubble up all the generated tokens for that tokens the to end stream.
|
||||
pub struct GenerationContext {
|
||||
sender: UnboundedSender<InferResult<InferStreamResponse>>,
|
||||
tokenizer: Arc<Tokenizer>,
|
||||
tokens: Vec<u32>,
|
||||
done: Arc<AtomicBool>,
|
||||
queued: Instant,
|
||||
start: Option<Instant>,
|
||||
}
|
||||
|
||||
impl Stream for Generation {
|
||||
type Item = usize;
|
||||
|
||||
fn poll_next(self: Pin<&mut Self>, ctx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
|
||||
let interval = POLLING_INTERVAL_US.get_or_init(|| {
|
||||
u64::from_str(option_env!("TRTLLM_BACKEND_POLLING_INTERVAL_US").unwrap_or("100"))
|
||||
.expect("Invalid value provided for envvar POLLING_INTERVAL_US")
|
||||
});
|
||||
|