Commit Graph

836 Commits

Author SHA1 Message Date
Morgan Funtowicz e983ee5bb8 make sure the context is not dropped in the middle of the async decoding. 2024-07-17 21:56:50 +00:00
Morgan Funtowicz 9220340ff7 compute the number of maximum new tokens for each request independently 2024-07-17 13:55:29 +00:00
Morgan Funtowicz a01cd030d4 oops missing c++ backend definitions 2024-07-16 20:11:59 +00:00
Morgan Funtowicz 7784a21d48 impl RwLock scenario for TensorRtLllmBackend 2024-07-16 20:08:10 +00:00
Morgan Funtowicz 31d9f4d5dc expose shutdown function at ffi layer 2024-07-15 07:36:01 +00:00
Morgan Funtowicz b291be64a0 impl the rust backend which currently cannot move the actual computation in background thread 2024-07-12 19:26:32 +00:00
Morgan Funtowicz 518d9a9e0b make sure to track include/ffi.h to trigger rebuild from cargo 2024-07-12 19:26:04 +00:00
Morgan Funtowicz 344f33f398 end to end ffi flow working 2024-07-12 19:25:40 +00:00
Morgan Funtowicz b846ae2d9e use external fmt lib 2024-07-12 19:24:59 +00:00
Morgan Funtowicz 1972669f49 remove fmt import 2024-07-12 19:24:09 +00:00
Morgan Funtowicz 50e9fc89c8 working setup of the ffi layer 2024-07-11 21:24:32 +00:00
Morgan Funtowicz 5aede911f8 include guard to build example in cmakelists 2024-07-11 21:24:01 +00:00
Morgan Funtowicz ed14bd6818 use correct include for spdlog 2024-07-10 13:57:31 +00:00
Morgan Funtowicz 42748d5960 allow converting huggingface::tokenizers error to TensorRtLlmBackendError 2024-07-10 13:56:57 +00:00
Morgan Funtowicz 40fe2ec0ff add auth_token CLI argument to provide hf hub authentification token 2024-07-10 13:50:28 +00:00
Morgan Funtowicz ca9da2dd49 create cmake install target to put everything relevant in installation folder 2024-07-10 13:48:59 +00:00
Morgan Funtowicz 4272b8cf51 correctly tell cmake to build dependent tensorrt-llm required libraries 2024-07-10 13:48:44 +00:00
Morgan Funtowicz 6c92ebe6a8 update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c 2024-07-10 13:47:56 +00:00
Morgan Funtowicz 7b9f92a0aa use spdlog release 1.14.1 moving forward 2024-07-10 13:47:31 +00:00
Morgan Funtowicz 13eabfabcb implement the Stream method to send new tokens through a callback 2024-07-09 13:46:48 +00:00
Morgan Funtowicz 09292b06a0 updated logic and comment to detect cuda compute capabilities 2024-07-09 12:15:41 +00:00
Morgan Funtowicz bec188ff73 bind to CUDA::nvml to retrieve compute capabilities at runtime 2024-07-08 22:32:41 +00:00
Morgan Funtowicz 68a0247a2c unconditionally call InitializeBackend on the FFI layer 2024-07-08 22:09:09 +00:00
Morgan Funtowicz da926feaa1 make leader executor mode working 2024-07-08 22:08:49 +00:00
Morgan Funtowicz f53ffa886d Specify which default log level to use depending on CMake build type 2024-07-08 22:06:49 +00:00
Morgan Funtowicz 4113d6d51b Move to latest TensorRT-LLM version 2024-07-08 22:06:30 +00:00
Morgan Funtowicz 29c7cb36e5 Remembering to check how we can detect support for chunked context 2024-07-03 21:38:17 +00:00
Morgan Funtowicz f57f2a4521 First version loading engines and making it ready for inference 2024-07-03 21:12:24 +00:00
Morgan Funtowicz f8a1463915 Enable end to end CMake build 2024-07-03 10:27:53 +02:00
Morgan Funtowicz 818162e0c2 Overall build TRTLLM and deps through CMake build system 2024-07-02 17:16:27 +02:00
Morgan Funtowicz 6dc98abe46 Remove unused parameters annd force tokenizer name to be set 2024-07-01 16:11:59 +02:00
Morgan Funtowicz 47ac5c654d Working FFI call for TGI and TRTLLM backend 2024-07-01 15:53:23 +02:00
Morgan Funtowicz dc402dc9ac Initial setup for CXX binding to TRTLLM 2024-06-30 23:37:20 +02:00
OlivierDehaene 230f2a415a refacto 2024-06-26 14:12:01 +02:00
OlivierDehaene 93e0a7de8b refacto 2024-06-26 14:00:03 +02:00
OlivierDehaene b562680be4 wip 2024-06-26 13:13:32 +02:00
OlivierDehaene 504754861f wip 2024-06-26 12:08:56 +02:00
drbh be2d38032a
fix: simplify kserve endpoint and fix imports (#2119) 2024-06-25 19:30:10 -04:00
Daniël de Kok f1f98e369f
Add support for Marlin 2:4 sparsity (#2102)
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.
2024-06-25 21:09:42 +02:00
Daniël de Kok 14980df2df
Support AWQ quantization with bias (#2117)
When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.
2024-06-25 21:09:00 +02:00
drbh 04e1af94d7
Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-06-25 14:46:27 -04:00
Nicolas Patry a2a97b05d6
Fix CI . (#2118)
Fix clippy.
2024-06-25 17:53:36 +02:00
Daniël de Kok fc9c3153e5
Add pytest release marker (#2114)
* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI
2024-06-25 16:53:20 +02:00
Wang, Yi e563983d90
fix cpu and xpu issue (#2116)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-06-25 16:47:06 +02:00
Nicolas Patry 9e2fdf57c0
Removing IPEX_AVAIL. (#2115)
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
2024-06-25 13:20:57 +02:00
drbh 3f3b7ffd67
feat: add simple tests for weights (#2092)
* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes
2024-06-25 12:22:59 +02:00
Wang, Yi b64c70c9e7
Cpu tgi (#1936)
* add CPU tgi support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2024-06-25 12:21:29 +02:00
sunxichen b69f078041
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>
2024-06-25 10:59:50 +02:00
Wang, Yi 83634dc122
use xpu-smi to dump used memory (#2047)
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:15:46 +02:00
Jeff 5b2155b0f8
corrected Pydantic warning. (#2095)
* corrected Pydantic warning.

* Update clients/python/text_generation/types.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:10:32 +02:00