Commit Graph

821 Commits

Author SHA1 Message Date
Morgan Funtowicz ca9da2dd49 create cmake install target to put everything relevant in installation folder 2024-07-10 13:48:59 +00:00
Morgan Funtowicz 4272b8cf51 correctly tell cmake to build dependent tensorrt-llm required libraries 2024-07-10 13:48:44 +00:00
Morgan Funtowicz 6c92ebe6a8 update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c 2024-07-10 13:47:56 +00:00
Morgan Funtowicz 7b9f92a0aa use spdlog release 1.14.1 moving forward 2024-07-10 13:47:31 +00:00
Morgan Funtowicz 13eabfabcb implement the Stream method to send new tokens through a callback 2024-07-09 13:46:48 +00:00
Morgan Funtowicz 09292b06a0 updated logic and comment to detect cuda compute capabilities 2024-07-09 12:15:41 +00:00
Morgan Funtowicz bec188ff73 bind to CUDA::nvml to retrieve compute capabilities at runtime 2024-07-08 22:32:41 +00:00
Morgan Funtowicz 68a0247a2c unconditionally call InitializeBackend on the FFI layer 2024-07-08 22:09:09 +00:00
Morgan Funtowicz da926feaa1 make leader executor mode working 2024-07-08 22:08:49 +00:00
Morgan Funtowicz f53ffa886d Specify which default log level to use depending on CMake build type 2024-07-08 22:06:49 +00:00
Morgan Funtowicz 4113d6d51b Move to latest TensorRT-LLM version 2024-07-08 22:06:30 +00:00
Morgan Funtowicz 29c7cb36e5 Remembering to check how we can detect support for chunked context 2024-07-03 21:38:17 +00:00
Morgan Funtowicz f57f2a4521 First version loading engines and making it ready for inference 2024-07-03 21:12:24 +00:00
Morgan Funtowicz f8a1463915 Enable end to end CMake build 2024-07-03 10:27:53 +02:00
Morgan Funtowicz 818162e0c2 Overall build TRTLLM and deps through CMake build system 2024-07-02 17:16:27 +02:00
Morgan Funtowicz 6dc98abe46 Remove unused parameters annd force tokenizer name to be set 2024-07-01 16:11:59 +02:00
Morgan Funtowicz 47ac5c654d Working FFI call for TGI and TRTLLM backend 2024-07-01 15:53:23 +02:00
Morgan Funtowicz dc402dc9ac Initial setup for CXX binding to TRTLLM 2024-06-30 23:37:20 +02:00
OlivierDehaene 230f2a415a refacto 2024-06-26 14:12:01 +02:00
OlivierDehaene 93e0a7de8b refacto 2024-06-26 14:00:03 +02:00
OlivierDehaene b562680be4 wip 2024-06-26 13:13:32 +02:00
OlivierDehaene 504754861f wip 2024-06-26 12:08:56 +02:00
drbh be2d38032a
fix: simplify kserve endpoint and fix imports (#2119) 2024-06-25 19:30:10 -04:00
Daniël de Kok f1f98e369f
Add support for Marlin 2:4 sparsity (#2102)
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.
2024-06-25 21:09:42 +02:00
Daniël de Kok 14980df2df
Support AWQ quantization with bias (#2117)
When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.
2024-06-25 21:09:00 +02:00
drbh 04e1af94d7
Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-06-25 14:46:27 -04:00
Nicolas Patry a2a97b05d6
Fix CI . (#2118)
Fix clippy.
2024-06-25 17:53:36 +02:00
Daniël de Kok fc9c3153e5
Add pytest release marker (#2114)
* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI
2024-06-25 16:53:20 +02:00
Wang, Yi e563983d90
fix cpu and xpu issue (#2116)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-06-25 16:47:06 +02:00
Nicolas Patry 9e2fdf57c0
Removing IPEX_AVAIL. (#2115)
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
2024-06-25 13:20:57 +02:00
drbh 3f3b7ffd67
feat: add simple tests for weights (#2092)
* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes
2024-06-25 12:22:59 +02:00
Wang, Yi b64c70c9e7
Cpu tgi (#1936)
* add CPU tgi support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2024-06-25 12:21:29 +02:00
sunxichen b69f078041
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>
2024-06-25 10:59:50 +02:00
Wang, Yi 83634dc122
use xpu-smi to dump used memory (#2047)
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:15:46 +02:00
Jeff 5b2155b0f8
corrected Pydantic warning. (#2095)
* corrected Pydantic warning.

* Update clients/python/text_generation/types.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:10:32 +02:00
KevinDuffy94 1869ee2f57
Add OTLP Service Name Environment Variable (#2076)
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option
2024-06-25 09:33:01 +02:00
Lucain 3447c722fd
Support `HF_TOKEN` environment variable (#2066)
* Support HF_TOKEN environement variable

* Load test.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-06-25 09:23:12 +02:00
ur4t 405765b18c
Fix cargo-chef prepare (#2101)
* Fix cargo-chef prepare

In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.

* Fix Dockerfile_amd and Dockerfile_intel
2024-06-24 18:16:36 +02:00
Nicolas Patry 480d3b3304
New runner. Manual squash. (#2110)
* New runner. Manual squash.

* Network host.

* Put back trufflehog with proper extension.

* No network host ?

* Moving buildx install after tailscale ?

* 1.79
2024-06-24 18:08:34 +02:00
drbh 811a9381b1
feat: sort cuda graphs in descending order (#2104) 2024-06-21 14:28:26 -04:00
Daniël de Kok 197c47a302
Fix `text-generation-server quantize` (#2103)
The subcommand did not work due to some broken imports.
2024-06-21 15:28:51 +02:00
Daniël de Kok bcb3faa1c2
Factor out sharding of packed tensors (#2059)
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
2024-06-20 09:56:04 +02:00
Daniël de Kok f5a9837592
Support exl2-quantized Qwen2 models (#2085)
Fixes #2081.
2024-06-20 07:56:16 +02:00
drbh cdbf802860
feat: rotate tests ci token (#2091) 2024-06-19 17:02:58 -04:00
Daniël de Kok 11ea9ce002
CI: pass pre-commit hooks again (#2084) 2024-06-18 09:38:21 +02:00
Guillaume LEGENDRE 4f25c67d63
CI: Tailscale improvements (#2079)
* test local tailscale

* Update build.yaml

* Update build.yaml

* Update build.yaml

* Update build.yaml

* wait for ssh

* network host

* change step order
2024-06-18 09:13:04 +02:00
Daniël de Kok c8c7ccd31e
Set maximum grpc message receive size to 2GiB (#2075)
* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass
2024-06-17 16:40:44 +02:00
Ziru Niu 0f7d38e774
fix build.rs watch files (#2072) 2024-06-17 12:10:01 +02:00
Lysandre Debut 131838919e
Contributing guide & Code of Conduct (#2074)
* Contributing guide & Code of Conduct

* Redirect to GitHub's tutorial on PRs
2024-06-17 12:09:31 +02:00
Daniël de Kok e903770897
Support different image sizes in prefill in VLMs (#2065)
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-06-17 10:49:41 +02:00