hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Daniël de Kok	4562c16048	Use a block size of 1 for FlashInfer	2024-08-01 13:51:17 +00:00
Daniël de Kok	8fb8e1da78	Add FlashInfer support This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-08-01 13:41:34 +00:00
Daniël de Kok	fe41e13b45	Unify attention output handling - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-08-01 13:41:34 +00:00
Wang, Yi	9ab9937414	enable HuggingFaceM4/idefics-9b in intel gpu (#2338 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-01 11:08:36 +02:00
drbh	f7f61876cf	Pr 2290 ci run (#2329 ) * MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by: root <root@tw031.pit.tensorwave.lan>	2024-07-31 10:27:15 -04:00
Daniël de Kok	34f7dcfd80	Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300 ) The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.	2024-07-31 13:08:41 +02:00
Daniël de Kok	53aec27328	server quantize: store quantizer config in standard format (#2299 ) - Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.	2024-07-30 15:16:20 +02:00
Erik Kaunismäki	3d7f4f41bb	patch-error-on-invalid-grammar (#2282 ) * quick fix * allow silent failure * explicit todo that this is only short term	2024-07-29 10:09:25 -04:00
Daniël de Kok	922732b255	Install Marlin from standalone package (#2320 )	2024-07-29 15:37:10 +02:00
drbh	bab02ff2bc	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-07-26 10:29:09 -04:00
Daniël de Kok	4b49c50f4c	Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313 )	2024-07-26 14:57:24 +02:00
Daniël de Kok	9256d7c38c	Some small fixes for the Torch 2.4.0 update (#2304 ) * Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update	2024-07-25 13:34:44 +02:00
drbh	5d85a958c9	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-07-24 15:32:14 -04:00
Daniël de Kok	93d2b9fe9c	Split up `layers.marlin` into several files (#2292 ) The marlin.py file was getting large, split it up.	2024-07-24 16:33:26 +02:00
Wang, Yi	8642250602	fix of use of unquantized weights in cohere GQA loading, also enable … (#2291 ) fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:44:02 +02:00
Wang, Yi	5ad39dd3c3	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:39:08 +02:00
OlivierDehaene	a895029424	hotfix: update nccl	2024-07-23 23:31:28 +02:00
OlivierDehaene	e7e3aa6cac	chore: update to torch 2.4 (#2259 ) * chore: update to torch 2.4 * remove un-necessary patch * fix	2024-07-23 20:39:43 +00:00
Daniël de Kok	bc9593a5b1	hotfix: pin numpy (#2289 )	2024-07-23 17:53:19 +02:00
Daniël de Kok	4ab4173767	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-07-23 17:18:54 +02:00
shaltielshmid	3961e32390	[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-23 15:00:07 +02:00
Daniël de Kok	9935720c87	Add support for repacking AWQ weights for GPTQ-Marlin (#2278 ) * Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.	2024-07-23 13:08:20 +02:00
OlivierDehaene	5fca30ee15	fix(l4): fix fp8 logic on l4 (#2277 ) * fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89	2024-07-23 11:24:29 +02:00
Nicolas Patry	abc32537ea	Fixing mistral nemo. (#2276 )	2024-07-23 11:16:03 +02:00
Nicolas Patry	6aeb669072	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-07-22 18:27:10 +02:00
OlivierDehaene	4844ff790a	fix(server): fix fp8 weight loading (#2268 ) * fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype	2024-07-22 15:51:32 +00:00
icyboy™	4e4207224e	Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading	2024-07-22 11:31:00 +02:00
OlivierDehaene	f3435bab8c	fix(server): fix deepseekv2 loading (#2266 )	2024-07-21 18:48:04 +02:00
OlivierDehaene	53ec0b790b	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-07-20 19:02:04 +02:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
Daniël de Kok	3f37a66774	Hotfix: pass through model revision in `VlmCausalLM` (#2258 )	2024-07-19 15:59:00 +02:00
Daniël de Kok	3b41e93a09	Hotfix: fix MPT after recent refactor (#2257 )	2024-07-19 14:42:35 +02:00
Daniël de Kok	18db78f295	Hotfix: various GPT-based model fixes (#2256 )	2024-07-19 14:42:19 +02:00
Daniël de Kok	80adb5be16	Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255 )	2024-07-19 12:55:59 +02:00
Daniël de Kok	ba291dad9f	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-07-19 09:37:39 +02:00
OlivierDehaene	1d1b1efa01	fix(server): fix cohere (#2249 )	2024-07-18 16:00:13 +02:00
Daniël de Kok	da82c63a4f	Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237 ) Fixes #2236.	2024-07-16 09:30:57 +02:00
Daniël de Kok	2cb1842852	`server quantize`: expose groupsize option (#2225 )	2024-07-16 08:36:05 +02:00
Daniël de Kok	06d0e880e0	Add support for AWQ-quantized Idefics2 (#2233 ) Fixes #2036.	2024-07-16 07:58:25 +02:00
Hugo Larcher	0ad7f6f87d	fix: Remove bitsandbytes installation when running cpu-only install (#2216 ) Remove bitsandbytes installation when running cpu-only install	2024-07-15 15:34:20 +02:00
drbh	5a65066922	feat: simple mistral lora integration tests (#2180 ) * feat: simple mistral lora integration tests * fix: include args in docker launcher * fix: disable cuda graphs with lora and warn * fix: adjust docs and precommit issues * fix: re update docs	2024-07-15 09:16:15 -04:00
Daniël de Kok	dbb23fbfa8	Use symmetric quantization in the `quantize` subcommand (#2120 ) Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.	2024-07-12 12:20:12 +02:00
SeongBeomLEE	c46eaf707b	[fix] Modifying base in yarn embedding (#2212 )	2024-07-12 10:04:51 +02:00
Daniël de Kok	cb150eb295	Add support for FP8 on compute capability >=8.0, <8.9 (#2213 ) Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>	2024-07-11 16:03:26 +02:00
Daniël de Kok	8511669cb2	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-07-09 20:04:03 +02:00
fxmarty	4c50b6d04b	Fix nccl regression on PyTorch 2.3 upgrade (#2099 ) * fix nccl issue * add note in dockerfile * use v2.22.3 that also fixes @samsamoa's repro * poetry actually can't handle the conflict between torch and nccl * set LD_PRELOAD	2024-07-08 17:52:10 +02:00
Daniël de Kok	5c7c9f1390	Falcon/DBRX: get correct number of key-value heads (#2205 )	2024-07-08 13:22:38 +02:00
Daniël de Kok	153fcf7739	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-07-08 11:19:48 +02:00
Daniël de Kok	cce475a949	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-07-08 09:52:12 +02:00
icyboy™	521d0d990f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-07-08 09:01:14 +02:00

1 2 3 4 5 ...

530 Commits