hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
fxmarty	8c590be463	Merge branch 'main' into ci_amd3	2024-07-08 13:06:39 +02:00
Daniël de Kok	153fcf7739	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-07-08 11:19:48 +02:00
Daniël de Kok	cce475a949	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-07-08 09:52:12 +02:00
icyboy™	521d0d990f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-07-08 09:01:14 +02:00
Daniël de Kok	05c094fcfa	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-07-05 16:07:48 +02:00
Daniël de Kok	b67d46336e	Fix Starcoder2 after refactor (#2189 )	2024-07-05 12:22:45 +02:00
Nicolas Patry	853d4eb9cf	Hotfixing after refactor.	2024-07-05 09:25:29 +00:00
Nicolas Patry	fb2f74e2b9	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-07-05 10:29:56 +02:00
Aaron Mihalik	c6bcadf883	Adding "longrope" for Phi-3 (#2172 ) (#2179 ) Adding "longrope" for phi-3	2024-07-05 09:46:41 +02:00
fxmarty	29a416078c	Merge branch 'main' into ci_amd3	2024-07-02 15:32:53 +02:00
Felix Marty	add4d42cb3	do not use tunableop for non flash-causal-lm modezls	2024-07-02 12:52:55 +00:00
Nicolas Patry	0759ec495e	Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167 )	2024-07-02 14:26:47 +02:00
Nicolas Patry	dea9c0dc74	Fixing rocm. (#2164 )	2024-07-02 12:01:08 +02:00
drbh	b966bc0d35	fix: use the base layers weight in mistral rocm (#2155 )	2024-07-02 11:56:25 +02:00
Wang, Yi	5d97e0c4a3	fix FlashDecoding change's regression in intel platform (#2161 ) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-02 11:56:07 +02:00
Nicolas Patry	022f6515a4	Fixing graph capture for flash decoding. (#2163 )	2024-07-02 11:43:07 +02:00
Nicolas Patry	4327210e6b	[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-07-01 23:28:00 +02:00
Nicolas Patry	4f55f15840	Fixing baichuan override. (#2158 )	2024-07-01 23:25:54 +02:00
Wang, Yi	5da4cfab1c	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 ) * refine get xpu free memory Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-01 14:32:54 +02:00
Felix Marty	e0bfe4e7f0	fix	2024-07-01 12:31:56 +00:00
icyboy™	9d0ca503a8	fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123 ) https://github.com/huggingface/text-generation-inference/issues/2122	2024-07-01 14:17:22 +02:00
fxmarty	59849777de	Merge branch 'main' into ci_amd3	2024-07-01 14:14:46 +02:00
Felix Marty	9fd395fae4	fix tests	2024-07-01 12:12:26 +00:00
Daniël de Kok	2ce8019480	Use GPTQ-Marlin for supported GPTQ configurations (#2111 ) GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).	2024-07-01 12:59:12 +02:00
drbh	25f57e2e98	fix: use weights from base_layer (#2141 )	2024-07-01 12:58:40 +02:00
Felix Marty	3d50ff71b7	bump torch to more recent version	2024-06-28 13:10:43 +00:00
Nicolas Patry	3ea8259af1	Fixing gemma2. (#2135 ) * Fixing gemma2. * Adding new model.	2024-06-27 16:04:20 +02:00
Daniël de Kok	dd2d91b043	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-06-27 15:54:35 +02:00
fxmarty	227f78f3fe	Merge branch 'main' into ci_amd3	2024-06-26 12:08:42 +02:00
Daniël de Kok	f1f98e369f	Add support for Marlin 2:4 sparsity (#2102 ) This change adds support for 2:4 sparsity when using Marlin quantization. The 2:4 kernel is used when: * The quantizer is `marlin`; * the quantizer checkpoint format is `marlin_24`. Fixes #2098.	2024-06-25 21:09:42 +02:00
Daniël de Kok	14980df2df	Support AWQ quantization with bias (#2117 ) When the AWQ quantizer was used with a layer that uses a bias, the bias tensor was not correctly passed/used. Instead, the value `true`/`1.0` was added to the linear transformation. Correctly pass through the bias when it is not `None`. Fixes #2106.	2024-06-25 21:09:00 +02:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
Wang, Yi	e563983d90	fix cpu and xpu issue (#2116 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-06-25 16:47:06 +02:00
Nicolas Patry	9e2fdf57c0	Removing IPEX_AVAIL. (#2115 ) * Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN	2024-06-25 13:20:57 +02:00
drbh	3f3b7ffd67	feat: add simple tests for weights (#2092 ) * feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes	2024-06-25 12:22:59 +02:00
Wang, Yi	b64c70c9e7	Cpu tgi (#1936 ) * add CPU tgi support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>	2024-06-25 12:21:29 +02:00
fxmarty	dc53846456	Merge branch 'main' into ci_amd3	2024-06-25 11:20:00 +02:00
Wang, Yi	83634dc122	use xpu-smi to dump used memory (#2047 ) * use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-06-25 10:15:46 +02:00
KevinDuffy94	1869ee2f57	Add OTLP Service Name Environment Variable (#2076 ) * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option	2024-06-25 09:33:01 +02:00
Felix Marty	5b6b257756	fix gpt2 tests - some weights were not contiguous	2024-06-24 18:51:08 +02:00
fxmarty	1846c1c210	fix tests	2024-06-24 18:50:18 +02:00
fxmarty	5a4b798f98	fix gptq tests, LLMM1 matrix bound	2024-06-24 18:49:45 +02:00
fxmarty	49db30a137	disable marlin tests on rocm/xpu	2024-06-24 18:49:37 +02:00
drbh	811a9381b1	feat: sort cuda graphs in descending order (#2104 )	2024-06-21 14:28:26 -04:00
Daniël de Kok	197c47a302	Fix `text-generation-server quantize` (#2103 ) The subcommand did not work due to some broken imports.	2024-06-21 15:28:51 +02:00
Daniël de Kok	bcb3faa1c2	Factor out sharding of packed tensors (#2059 ) For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.	2024-06-20 09:56:04 +02:00
Daniël de Kok	f5a9837592	Support exl2-quantized Qwen2 models (#2085 ) Fixes #2081.	2024-06-20 07:56:16 +02:00
Daniël de Kok	c8c7ccd31e	Set maximum grpc message receive size to 2GiB (#2075 ) * Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass	2024-06-17 16:40:44 +02:00
Daniël de Kok	e903770897	Support different image sizes in prefill in VLMs (#2065 ) When a batch contained images if different sizes during prefill, the server would fail (see e.g. #2056). Images were processed separately and then concatenated. However, this can fail for images with different sizes. Fix this by preprocessing all images in the batch together, so that the image processor can ensure that all image tensors have compatible sizes.	2024-06-17 10:49:41 +02:00
Tiezhen WANG	96b7b40ca3	Update the link for qwen2 (#2068 ) * Update the link for qwen2 * Fix Qwen2 model URL in model table * Fix too eager staging --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-06-14 11:59:33 +02:00

1 2 3 4 5 ...

496 Commits