Commit Graph

1118 Commits

Author SHA1 Message Date
Morgan Funtowicz 188e4dc64f (misc: build for sm_{75,80,86,89,90} by default 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 544c9d9dba (fix): HOPPER_SM_MAJOR is 9 not 8 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 213acc6e34 (misc) move to latest trtllm 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 507ff66692 (misc) rerun-if-changed all the cmake modules 2024-10-21 10:00:27 +02:00
Morgan Funtowicz b242f45c04 (misc) delete backend.rs 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 984ae9798f (post) impl postprocessing 2024-10-21 10:00:27 +02:00
Morgan Funtowicz fa63db0d07 (scheduler) rework submit/pull logic 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 42ccf4e77c (misc) no need to move for uint32_t items 2024-10-21 10:00:27 +02:00
Morgan Funtowicz b41875c139 (misc) simplify [make_]move_iterator by using c++20 type inference 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 0f50539b77 (Dockerfile.trtllm) delete for now 2024-10-21 10:00:27 +02:00
Morgan Funtowicz b1846fb4e6 (backend) refactor & cleanup 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 483f172938 (ffi) do not use reference capture in lambda as we are not capturing anything 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 3d0e90b631 (ffi) missing namespace for tle::Response 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 8e648ce425 (ffi) fix usage of wrong vector constructor making a capacity fill call 2024-10-21 10:00:27 +02:00
Morgan Funtowicz dddc9a44bd (build) fetchcontent use archives instead of git 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 089c5fe668 (server) forward auth_token to server::run 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 291eaa99fb use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 7bebc629af (misc) missing Result types for Rust 2024-10-21 10:00:27 +02:00
Morgan Funtowicz c2e21d8725 (backend) implement the post_processor background thread 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 0dca168bcb (misc) change scope identifiers 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 933ab67aa1 (ffi) encode the provided user prompt within each request thread 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 0b0c30fe8b (ffi) remove narrowing type warning 2024-10-21 10:00:27 +02:00
Morgan Funtowicz fb759bdd2a (looper) new looper initial implementation 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 5f7c0b67c3 (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException> 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 33c962ef41 (ffi) add missing headers imports 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 2883c042ed (ffi) cleanup again 2024-10-21 10:00:27 +02:00
Morgan Funtowicz f4a74be384 (backend) expose PullNewTokens 2024-10-21 10:00:27 +02:00
Morgan Funtowicz b8a40a0af3 (backend) cleanup a bit 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 38b5263c61 (ffi) add max_new_tokens parameters 2024-10-21 10:00:27 +02:00
Morgan Funtowicz f6f689f509 (build) setup ccache if available 2024-10-21 10:00:27 +02:00
Morgan Funtowicz 2a339f99dd (trt) 2024-10-21 10:00:25 +02:00
Morgan Funtowicz 169e1f452f (server) expose new SchedulingError 2024-10-21 10:00:04 +02:00
Morgan Funtowicz 0cd7538a48 (ffi) use const for GetSamplingConfig 2024-10-21 09:57:26 +02:00
Morgan Funtowicz cea64e234f (chore) fmt ... why? 2024-10-21 09:57:26 +02:00
Morgan Funtowicz a3f7d76f7b (launcher) default new server::run parameters to false for now 2024-10-21 09:57:24 +02:00
Morgan Funtowicz 25b20cba2a (backend) use parking_lot crate for RwLock fairness
# Conflicts:
#	backends/trtllm/src/backend.rs
2024-10-21 09:57:16 +02:00
Daniël de Kok 5e0fb46821
Make handling of FP8 scales more consisent (#2666)
Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.
2024-10-19 09:05:01 +02:00
Nicolas Patry 153ff3740b
CI job. Gpt awq 4 (#2665)
* add gptq and awq int4 support in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-18 17:55:53 +02:00
Daniël de Kok 8ec57558cd
Break cycle between the attention implementations and KV cache (#2627) 2024-10-17 14:54:22 +02:00
drbh 5f32dea1e2
fix: prefer inplace softmax to avoid copy (#2661)
* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-17 08:49:02 -04:00
oOraph 1b97e084bf
fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663)
tgi-entrypoint: exec instead of spawning a child process

reason: otherwise parent will receive the signals when we'd like tgi to receive them
keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child

Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>
2024-10-17 11:15:26 +02:00
Daniël de Kok 59ea38cbca
Simplify the `attention` function (#2609)
* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support
2024-10-17 10:42:52 +02:00
Daniël de Kok 5bbe1ce028
Support `e4m3fn` KV cache (#2655)
* Support `e4m3fn` KV cache

* Make check more obvious
2024-10-17 10:42:16 +02:00
OlivierDehaene a6a0c97ed9
feat: prefill chunking (#2600)
* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-16 12:49:33 +02:00
Mohit Sharma 704a58c807
Fp8 e4m3_fnuz support for rocm (#2588)
* (feat) fp8 fnuz support for rocm

* (review comments) Fix compression_config load, type hints

* (bug) update all has_tensor

* (review_comments) fix typo and added comments

* (nit) improved comment
2024-10-16 09:54:50 +02:00
Alvaro Bartolome ffe05ccd05
Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651)
As spotted by @philschmid, the payload was compliant with Vertex AI, but
just partially, since ideally the most compliant version would be with
the generation kwargs flattened to be on the same level as the
`messages`; meaning that Vertex AI would still expect a list of
instances, but each instance would be an OpenAI-compatible instance,
which is more clear; and more aligned with the SageMaker integration
too, so kudos to him for spotting that; and sorry from my end for any
inconvenience @Narsil.
2024-10-15 18:11:59 +02:00
Daniël de Kok ce7e356561 Use flashinfer for Gemma 2. 2024-10-15 13:49:32 +00:00
Nicolas Patry cf04a43fb1
Fixing linters. (#2650) 2024-10-15 12:43:49 +02:00
Dmitry Rogozhkin 58848cb471
feat: enable pytorch xpu support for non-attention models (#2561)
XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.

Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2024-10-14 18:28:49 +02:00
Wang, Yi 7a82ddcbd0
update ipex to fix incorrect output of mllama in cpu (#2640)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-14 16:32:33 +02:00