hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Morgan Funtowicz	ca9da2dd49	create cmake install target to put everything relevant in installation folder	2024-07-10 13:48:59 +00:00
Morgan Funtowicz	4272b8cf51	correctly tell cmake to build dependent tensorrt-llm required libraries	2024-07-10 13:48:44 +00:00
Morgan Funtowicz	6c92ebe6a8	update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c	2024-07-10 13:47:56 +00:00
Morgan Funtowicz	7b9f92a0aa	use spdlog release 1.14.1 moving forward	2024-07-10 13:47:31 +00:00
Morgan Funtowicz	13eabfabcb	implement the Stream method to send new tokens through a callback	2024-07-09 13:46:48 +00:00
Morgan Funtowicz	09292b06a0	updated logic and comment to detect cuda compute capabilities	2024-07-09 12:15:41 +00:00
Morgan Funtowicz	bec188ff73	bind to CUDA::nvml to retrieve compute capabilities at runtime	2024-07-08 22:32:41 +00:00
Morgan Funtowicz	68a0247a2c	unconditionally call InitializeBackend on the FFI layer	2024-07-08 22:09:09 +00:00
Morgan Funtowicz	da926feaa1	make leader executor mode working	2024-07-08 22:08:49 +00:00
Morgan Funtowicz	f53ffa886d	Specify which default log level to use depending on CMake build type	2024-07-08 22:06:49 +00:00
Morgan Funtowicz	4113d6d51b	Move to latest TensorRT-LLM version	2024-07-08 22:06:30 +00:00
Morgan Funtowicz	29c7cb36e5	Remembering to check how we can detect support for chunked context	2024-07-03 21:38:17 +00:00
Morgan Funtowicz	f57f2a4521	First version loading engines and making it ready for inference	2024-07-03 21:12:24 +00:00
Morgan Funtowicz	f8a1463915	Enable end to end CMake build	2024-07-03 10:27:53 +02:00
Morgan Funtowicz	818162e0c2	Overall build TRTLLM and deps through CMake build system	2024-07-02 17:16:27 +02:00
Morgan Funtowicz	6dc98abe46	Remove unused parameters annd force tokenizer name to be set	2024-07-01 16:11:59 +02:00
Morgan Funtowicz	47ac5c654d	Working FFI call for TGI and TRTLLM backend	2024-07-01 15:53:23 +02:00
Morgan Funtowicz	dc402dc9ac	Initial setup for CXX binding to TRTLLM	2024-06-30 23:37:20 +02:00
OlivierDehaene	230f2a415a	refacto	2024-06-26 14:12:01 +02:00
OlivierDehaene	93e0a7de8b	refacto	2024-06-26 14:00:03 +02:00
OlivierDehaene	b562680be4	wip	2024-06-26 13:13:32 +02:00
OlivierDehaene	504754861f	wip	2024-06-26 12:08:56 +02:00
drbh	be2d38032a	fix: simplify kserve endpoint and fix imports (#2119 )	2024-06-25 19:30:10 -04:00
Daniël de Kok	f1f98e369f	Add support for Marlin 2:4 sparsity (#2102 ) This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.	2024-06-25 21:09:42 +02:00
Daniël de Kok	14980df2df	Support AWQ quantization with bias (#2117 ) When the AWQ quantizer was used with a layer that uses a bias, the bias tensor was not correctly passed/used. Instead, the value `true`/`1.0` was added to the linear transformation. Correctly pass through the bias when it is not `None`. Fixes #2106.	2024-06-25 21:09:00 +02:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
Nicolas Patry	a2a97b05d6	Fix CI . (#2118 ) Fix clippy.	2024-06-25 17:53:36 +02:00
Daniël de Kok	fc9c3153e5	Add pytest release marker (#2114 ) * Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI	2024-06-25 16:53:20 +02:00
Wang, Yi	e563983d90	fix cpu and xpu issue (#2116 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-06-25 16:47:06 +02:00
Nicolas Patry	9e2fdf57c0	Removing IPEX_AVAIL. (#2115 ) * Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN	2024-06-25 13:20:57 +02:00
drbh	3f3b7ffd67	feat: add simple tests for weights (#2092 ) * feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes	2024-06-25 12:22:59 +02:00
Wang, Yi	b64c70c9e7	Cpu tgi (#1936 ) * add CPU tgi support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>	2024-06-25 12:21:29 +02:00
sunxichen	b69f078041	fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089 ) Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>	2024-06-25 10:59:50 +02:00
Wang, Yi	83634dc122	use xpu-smi to dump used memory (#2047 ) * use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-06-25 10:15:46 +02:00
Jeff	5b2155b0f8	corrected Pydantic warning. (#2095 ) * corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-06-25 10:10:32 +02:00
KevinDuffy94	1869ee2f57	Add OTLP Service Name Environment Variable (#2076 ) * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option	2024-06-25 09:33:01 +02:00
Lucain	3447c722fd	Support `HF_TOKEN` environment variable (#2066 ) * Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-06-25 09:23:12 +02:00
ur4t	405765b18c	Fix cargo-chef prepare (#2101 ) * Fix cargo-chef prepare In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly. If Cargo.lock is not present, cargo-chef will generate a new one first, which might vary a lot and invalidate docker build caches. * Fix Dockerfile_amd and Dockerfile_intel	2024-06-24 18:16:36 +02:00
Nicolas Patry	480d3b3304	New runner. Manual squash. (#2110 ) * New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79	2024-06-24 18:08:34 +02:00
drbh	811a9381b1	feat: sort cuda graphs in descending order (#2104 )	2024-06-21 14:28:26 -04:00
Daniël de Kok	197c47a302	Fix `text-generation-server quantize` (#2103 ) The subcommand did not work due to some broken imports.	2024-06-21 15:28:51 +02:00
Daniël de Kok	bcb3faa1c2	Factor out sharding of packed tensors (#2059 ) For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.	2024-06-20 09:56:04 +02:00
Daniël de Kok	f5a9837592	Support exl2-quantized Qwen2 models (#2085 ) Fixes #2081.	2024-06-20 07:56:16 +02:00
drbh	cdbf802860	feat: rotate tests ci token (#2091 )	2024-06-19 17:02:58 -04:00
Daniël de Kok	11ea9ce002	CI: pass pre-commit hooks again (#2084 )	2024-06-18 09:38:21 +02:00
Guillaume LEGENDRE	4f25c67d63	CI: Tailscale improvements (#2079 ) * test local tailscale * Update build.yaml * Update build.yaml * Update build.yaml * Update build.yaml * wait for ssh * network host * change step order	2024-06-18 09:13:04 +02:00
Daniël de Kok	c8c7ccd31e	Set maximum grpc message receive size to 2GiB (#2075 ) * Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass	2024-06-17 16:40:44 +02:00
Ziru Niu	0f7d38e774	fix build.rs watch files (#2072 )	2024-06-17 12:10:01 +02:00
Lysandre Debut	131838919e	Contributing guide & Code of Conduct (#2074 ) * Contributing guide & Code of Conduct * Redirect to GitHub's tutorial on PRs	2024-06-17 12:09:31 +02:00
Daniël de Kok	e903770897	Support different image sizes in prefill in VLMs (#2065 ) When a batch contained images if different sizes during prefill, the server would fail (see e.g. #2056). Images were processed separately and then concatenated. However, this can fail for images with different sizes. Fix this by preprocessing all images in the batch together, so that the image processor can ensure that all image tensors have compatible sizes.	2024-06-17 10:49:41 +02:00

1 2 3 4 5 ...

821 Commits All Branches Search

821 Commits

All Branches