hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Morgan Funtowicz	e983ee5bb8	make sure the context is not dropped in the middle of the async decoding.	2024-07-17 21:56:50 +00:00
Morgan Funtowicz	9220340ff7	compute the number of maximum new tokens for each request independently	2024-07-17 13:55:29 +00:00
Morgan Funtowicz	a01cd030d4	oops missing c++ backend definitions	2024-07-16 20:11:59 +00:00
Morgan Funtowicz	7784a21d48	impl RwLock scenario for TensorRtLllmBackend	2024-07-16 20:08:10 +00:00
Morgan Funtowicz	31d9f4d5dc	expose shutdown function at ffi layer	2024-07-15 07:36:01 +00:00
Morgan Funtowicz	b291be64a0	impl the rust backend which currently cannot move the actual computation in background thread	2024-07-12 19:26:32 +00:00
Morgan Funtowicz	518d9a9e0b	make sure to track include/ffi.h to trigger rebuild from cargo	2024-07-12 19:26:04 +00:00
Morgan Funtowicz	344f33f398	end to end ffi flow working	2024-07-12 19:25:40 +00:00
Morgan Funtowicz	b846ae2d9e	use external fmt lib	2024-07-12 19:24:59 +00:00
Morgan Funtowicz	1972669f49	remove fmt import	2024-07-12 19:24:09 +00:00
Morgan Funtowicz	50e9fc89c8	working setup of the ffi layer	2024-07-11 21:24:32 +00:00
Morgan Funtowicz	5aede911f8	include guard to build example in cmakelists	2024-07-11 21:24:01 +00:00
Morgan Funtowicz	ed14bd6818	use correct include for spdlog	2024-07-10 13:57:31 +00:00
Morgan Funtowicz	42748d5960	allow converting huggingface::tokenizers error to TensorRtLlmBackendError	2024-07-10 13:56:57 +00:00
Morgan Funtowicz	40fe2ec0ff	add auth_token CLI argument to provide hf hub authentification token	2024-07-10 13:50:28 +00:00
Morgan Funtowicz	ca9da2dd49	create cmake install target to put everything relevant in installation folder	2024-07-10 13:48:59 +00:00
Morgan Funtowicz	4272b8cf51	correctly tell cmake to build dependent tensorrt-llm required libraries	2024-07-10 13:48:44 +00:00
Morgan Funtowicz	6c92ebe6a8	update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c	2024-07-10 13:47:56 +00:00
Morgan Funtowicz	7b9f92a0aa	use spdlog release 1.14.1 moving forward	2024-07-10 13:47:31 +00:00
Morgan Funtowicz	13eabfabcb	implement the Stream method to send new tokens through a callback	2024-07-09 13:46:48 +00:00
Morgan Funtowicz	09292b06a0	updated logic and comment to detect cuda compute capabilities	2024-07-09 12:15:41 +00:00
Morgan Funtowicz	bec188ff73	bind to CUDA::nvml to retrieve compute capabilities at runtime	2024-07-08 22:32:41 +00:00
Morgan Funtowicz	68a0247a2c	unconditionally call InitializeBackend on the FFI layer	2024-07-08 22:09:09 +00:00
Morgan Funtowicz	da926feaa1	make leader executor mode working	2024-07-08 22:08:49 +00:00
Morgan Funtowicz	f53ffa886d	Specify which default log level to use depending on CMake build type	2024-07-08 22:06:49 +00:00
Morgan Funtowicz	4113d6d51b	Move to latest TensorRT-LLM version	2024-07-08 22:06:30 +00:00
Morgan Funtowicz	29c7cb36e5	Remembering to check how we can detect support for chunked context	2024-07-03 21:38:17 +00:00
Morgan Funtowicz	f57f2a4521	First version loading engines and making it ready for inference	2024-07-03 21:12:24 +00:00
Morgan Funtowicz	f8a1463915	Enable end to end CMake build	2024-07-03 10:27:53 +02:00
Morgan Funtowicz	818162e0c2	Overall build TRTLLM and deps through CMake build system	2024-07-02 17:16:27 +02:00
Morgan Funtowicz	6dc98abe46	Remove unused parameters annd force tokenizer name to be set	2024-07-01 16:11:59 +02:00
Morgan Funtowicz	47ac5c654d	Working FFI call for TGI and TRTLLM backend	2024-07-01 15:53:23 +02:00
Morgan Funtowicz	dc402dc9ac	Initial setup for CXX binding to TRTLLM	2024-06-30 23:37:20 +02:00
OlivierDehaene	230f2a415a	refacto	2024-06-26 14:12:01 +02:00
OlivierDehaene	93e0a7de8b	refacto	2024-06-26 14:00:03 +02:00
OlivierDehaene	b562680be4	wip	2024-06-26 13:13:32 +02:00
OlivierDehaene	504754861f	wip	2024-06-26 12:08:56 +02:00
drbh	be2d38032a	fix: simplify kserve endpoint and fix imports (#2119 )	2024-06-25 19:30:10 -04:00
Daniël de Kok	f1f98e369f	Add support for Marlin 2:4 sparsity (#2102 ) This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.	2024-06-25 21:09:42 +02:00
Daniël de Kok	14980df2df	Support AWQ quantization with bias (#2117 ) When the AWQ quantizer was used with a layer that uses a bias, the bias tensor was not correctly passed/used. Instead, the value `true`/`1.0` was added to the linear transformation. Correctly pass through the bias when it is not `None`. Fixes #2106.	2024-06-25 21:09:00 +02:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
Nicolas Patry	a2a97b05d6	Fix CI . (#2118 ) Fix clippy.	2024-06-25 17:53:36 +02:00
Daniël de Kok	fc9c3153e5	Add pytest release marker (#2114 ) * Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI	2024-06-25 16:53:20 +02:00
Wang, Yi	e563983d90	fix cpu and xpu issue (#2116 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-06-25 16:47:06 +02:00
Nicolas Patry	9e2fdf57c0	Removing IPEX_AVAIL. (#2115 ) * Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN	2024-06-25 13:20:57 +02:00
drbh	3f3b7ffd67	feat: add simple tests for weights (#2092 ) * feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes	2024-06-25 12:22:59 +02:00
Wang, Yi	b64c70c9e7	Cpu tgi (#1936 ) * add CPU tgi support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>	2024-06-25 12:21:29 +02:00
sunxichen	b69f078041	fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089 ) Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>	2024-06-25 10:59:50 +02:00
Wang, Yi	83634dc122	use xpu-smi to dump used memory (#2047 ) * use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-06-25 10:15:46 +02:00
Jeff	5b2155b0f8	corrected Pydantic warning. (#2095 ) * corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-06-25 10:10:32 +02:00

1 2 3 4 5 ...

836 Commits All Branches Search

836 Commits

All Branches