hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	57b3495823	Fixing exl2 and other quanize tests again. (#2419 ) * Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.	2024-08-15 11:12:51 +02:00
Nicolas Patry	f3b5c69441	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-08-14 11:58:08 +02:00
drbh	1cebccc72b	fix: adds causal to attention params (#2408 ) fix: adds causal to attention params to check when using flash attn v1	2024-08-13 16:19:46 +02:00
Wang, Yi	59922f9bc1	add numa to improve cpu inference perf (#2330 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-13 15:33:55 +02:00
drbh	8a7749b8fb	fix: include create_exllama_buffers and set_device for exllama (#2407 )	2024-08-12 17:59:37 -04:00
drbh	4c3f8a70a1	fix: allocate tmp based on sgmv kernel if available (#2345 ) * fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels	2024-08-12 17:24:32 +02:00
drbh	155f9c98e2	feat: validate template variables before apply and improve sliding wi… (#2403 ) * feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test	2024-08-12 10:58:40 -04:00
Daniël de Kok	8deeaca4ff	Add support for prefix caching to the v3 router (#2392 ) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.	2024-08-12 14:59:17 +02:00
Nicolas Patry	84bc3d7b7d	Fixing import exl2 (#2399 )	2024-08-12 14:08:59 +02:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
Vaibhav Srivastav	b2b9c42724	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-08-09 15:01:34 +02:00
Daniël de Kok	7830de1566	Add FlashInfer support (#2354 ) This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-08-09 11:42:00 +02:00
drbh	f852190060	fix: prefer hidden_activation over hidden_act in gemma2 (#2381 )	2024-08-08 14:08:56 -04:00
drbh	2ca5980634	Pr 2337 ci branch (#2379 ) * hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-08 12:30:29 -04:00
Wang, Yi	689b1abbf6	fix EleutherAI/gpt-neox-20b does not work in tgi (#2346 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-08 12:08:52 -04:00
drbh	82d19d7723	Pr 2374 ci branch (#2378 ) * Update __init__.py Fix issue with NoneType comparison for max_input_tokens and sliding_window - Add default values for max_input_tokens and sliding_window to handle None cases. - Ensure the comparison between max_input_tokens and sliding_window is handled correctly to prevent TypeError. - This change addresses the error: TypeError: '<=' not supported between instances of 'int' and 'NoneType'. * Update __init__.py Handle NoneType in sliding_window comparison to fix TypeError in __init__.py by ensuring the comparison logic accounts for NoneType values, preventing errors and improving code robustness. * fix: syntax/style tweak --------- Co-authored-by: Praz <prazanth2006@gmail.com>	2024-08-08 11:14:06 -04:00
drbh	a379d5536b	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 ) * Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>	2024-08-07 23:14:02 -04:00
drbh	21267f3ca3	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-07 21:32:37 -04:00
almersawi	8094ecfc9e	fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350 ) Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>	2024-08-07 19:45:23 -04:00
drbh	133015f408	fix: prefer original layernorm names for 180B (#2365 )	2024-08-06 15:25:30 -04:00
drbh	a64d407d64	fix: default num_ln_in_parallel_attn to one if not supplied (#2364 )	2024-08-06 13:33:22 -04:00
drbh	29b8d19cdf	fix: return the out tensor rather then the functions return value (#2361 )	2024-08-06 13:49:53 +02:00
drbh	215ed3ad52	fix: attempt forward on flash attn2 to check hardware support (#2335 ) * fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak	2024-08-05 09:11:40 -04:00
Daniël de Kok	47447ef017	Unify attention output handling (#2343 ) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-08-01 17:03:28 +02:00
Wang, Yi	9ab9937414	enable HuggingFaceM4/idefics-9b in intel gpu (#2338 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-01 11:08:36 +02:00
drbh	f7f61876cf	Pr 2290 ci run (#2329 ) * MODEL_ID propagation fix * fix: remove global model id --------- Co-authored-by: root <root@tw031.pit.tensorwave.lan>	2024-07-31 10:27:15 -04:00
Daniël de Kok	34f7dcfd80	Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300 ) The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.	2024-07-31 13:08:41 +02:00
Daniël de Kok	53aec27328	server quantize: store quantizer config in standard format (#2299 ) - Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.	2024-07-30 15:16:20 +02:00
Erik Kaunismäki	3d7f4f41bb	patch-error-on-invalid-grammar (#2282 ) * quick fix * allow silent failure * explicit todo that this is only short term	2024-07-29 10:09:25 -04:00
Daniël de Kok	922732b255	Install Marlin from standalone package (#2320 )	2024-07-29 15:37:10 +02:00
drbh	bab02ff2bc	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-07-26 10:29:09 -04:00
Daniël de Kok	4b49c50f4c	Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313 )	2024-07-26 14:57:24 +02:00
Daniël de Kok	9256d7c38c	Some small fixes for the Torch 2.4.0 update (#2304 ) * Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update	2024-07-25 13:34:44 +02:00
drbh	5d85a958c9	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-07-24 15:32:14 -04:00
Daniël de Kok	93d2b9fe9c	Split up `layers.marlin` into several files (#2292 ) The marlin.py file was getting large, split it up.	2024-07-24 16:33:26 +02:00
Wang, Yi	8642250602	fix of use of unquantized weights in cohere GQA loading, also enable … (#2291 ) fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:44:02 +02:00
Wang, Yi	5ad39dd3c3	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:39:08 +02:00
Daniël de Kok	4ab4173767	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-07-23 17:18:54 +02:00
shaltielshmid	3961e32390	[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-23 15:00:07 +02:00
Daniël de Kok	9935720c87	Add support for repacking AWQ weights for GPTQ-Marlin (#2278 ) * Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.	2024-07-23 13:08:20 +02:00
OlivierDehaene	5fca30ee15	fix(l4): fix fp8 logic on l4 (#2277 ) * fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89	2024-07-23 11:24:29 +02:00
Nicolas Patry	abc32537ea	Fixing mistral nemo. (#2276 )	2024-07-23 11:16:03 +02:00
Nicolas Patry	6aeb669072	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-07-22 18:27:10 +02:00
OlivierDehaene	4844ff790a	fix(server): fix fp8 weight loading (#2268 ) * fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype	2024-07-22 15:51:32 +00:00
icyboy™	4e4207224e	Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading	2024-07-22 11:31:00 +02:00
OlivierDehaene	f3435bab8c	fix(server): fix deepseekv2 loading (#2266 )	2024-07-21 18:48:04 +02:00
OlivierDehaene	53ec0b790b	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-07-20 19:02:04 +02:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
Daniël de Kok	3f37a66774	Hotfix: pass through model revision in `VlmCausalLM` (#2258 )	2024-07-19 15:59:00 +02:00
Daniël de Kok	3b41e93a09	Hotfix: fix MPT after recent refactor (#2257 )	2024-07-19 14:42:35 +02:00

1 2 3 4 5 ...

420 Commits