hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Daniël de Kok	1c84a30fe6	MoE Marlin: support `desc_act` for `groupsize != -1` (#2590 ) This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.	2024-09-30 19:40:25 +02:00
drbh	93a7042d7e	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-30 11:15:09 +02:00
Daniël de Kok	90a1d04a2f	Add support for GPTQ-quantized MoE models using MoE Marlin (#2557 ) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.	2024-09-30 11:14:32 +02:00
Mohit Sharma	f9e561eced	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-09-30 10:54:32 +02:00
Daniël de Kok	1028996fb3	flashinfer: pass window size and dtype (#2574 )	2024-09-28 18:41:41 +02:00
Daniël de Kok	5b6b74e21d	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-09-27 16:19:42 +02:00
Nicolas Patry	dd8691b7c5	More tensor cores. (#2558 ) * More tensor cores. * Fixing the logic. * Gemma is modified by this.	2024-09-24 23:57:26 +02:00
Daniël de Kok	3f14cd1420	Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537 ) This replaces the custom layers in both models.	2024-09-24 14:27:06 +02:00
Daniël de Kok	c29dc89c18	Add support for scalar FP8 weight scales (#2550 ) * Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print	2024-09-24 13:57:40 +02:00
Daniël de Kok	ce85efa968	Move to moe-kernels package and switch to common MoE layer (#2511 ) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner	2024-09-17 18:08:58 +02:00
Wang, Yi	3ac7df2b6d	hotfix : enable intel ipex cpu and xpu in python3.11 (#2517 ) enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-12 17:23:49 +02:00
Wang, Yi	5cd8025f18	hotfix: fix regression of attention api change in intel platform (#2439 ) fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache format kv input now. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-05 17:41:39 +02:00
Nicolas Patry	d9fbbaafb0	Tied embeddings in MLP speculator. (#2473 ) * Tied embeddings in MLP speculator. * Fixing the scale_weight when users decide to not use the speculation as much as defined in the config. * Adding scaling support + optimize some ops.	2024-08-29 17:44:54 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
Nicolas Patry	b70ae0969f	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-20 11:15:30 +02:00
Nicolas Patry	f3b5c69441	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-08-14 11:58:08 +02:00
drbh	1cebccc72b	fix: adds causal to attention params (#2408 ) fix: adds causal to attention params to check when using flash attn v1	2024-08-13 16:19:46 +02:00
drbh	8a7749b8fb	fix: include create_exllama_buffers and set_device for exllama (#2407 )	2024-08-12 17:59:37 -04:00
Nicolas Patry	84bc3d7b7d	Fixing import exl2 (#2399 )	2024-08-12 14:08:59 +02:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
Daniël de Kok	7830de1566	Add FlashInfer support (#2354 ) This change adds support for FlashInfer. FlashInfer can be enabled using `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`. Since this functionality is currently only for testing, FlashInfer is not installed anywhere yet. The FlashInfer API is quite different from FlashAttention/vLLM in that it requires more global bookkeeping: * A wrapper class needs to be contstructed (which we just call state). Since this is fairly expensive (due to pinned host memory allocation), we only do this once in a FlashCausalLM instance or for each CUDA Graph size. * Each model forward call needs to be wrapped in `begin_forward` and `end_forward`. This sets up data structures that can be reused for all calls to attention for that forward call. When calling attention, we need access to the state object. To avoid passing an argument down the call chain (which would require changes to all models), we use a context variable. Each model forward call is wrapped using a context manager that does all the bookkeeping for such a call: * Set the context variable to the forward call's state. * Call `begin_forward` on the state. * Yield. * Call `end_forward` on the state. * Reset the context variable. We cannot use a single shared global variable for this, since e.g. CUDA Graphs of different sizes each have their own state.	2024-08-09 11:42:00 +02:00
drbh	2ca5980634	Pr 2337 ci branch (#2379 ) * hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * reable gemma2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix in regression in ipex flashattention Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-08 12:30:29 -04:00
drbh	29b8d19cdf	fix: return the out tensor rather then the functions return value (#2361 )	2024-08-06 13:49:53 +02:00
drbh	215ed3ad52	fix: attempt forward on flash attn2 to check hardware support (#2335 ) * fix: attempt forward on flash attn2 to check hardware support * fix: warn window_size_left when using flash attn 1 * fix: prefer version check over test op and avoid window_size_left if not flash attn2 * fix: improve condtional and error message * fix: update sliding window conditional * fix: simplify changes and revert model changes * fix: avoid changing conditional * fix: typo tweak	2024-08-05 09:11:40 -04:00
Daniël de Kok	47447ef017	Unify attention output handling (#2343 ) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-08-01 17:03:28 +02:00
Daniël de Kok	34f7dcfd80	Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300 ) The `GPTWeightLoader` was structured like this in pseudocode: if marlin: Set up tensors in a way that GPTQ-Marlin expects else: Set up tensors in a way that ExLlama/GPTQ/AWQ expect However, the GPT-Marlin implementation details should really be in the `marlin` module. So move the former part out to a separate `GPTQMarlinWeightsLoader`.	2024-07-31 13:08:41 +02:00
Daniël de Kok	53aec27328	server quantize: store quantizer config in standard format (#2299 ) - Create `quantization_config` option in the model config. - Don't store the quantizer config in tensors anymore.	2024-07-30 15:16:20 +02:00
Daniël de Kok	922732b255	Install Marlin from standalone package (#2320 )	2024-07-29 15:37:10 +02:00
drbh	bab02ff2bc	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-07-26 10:29:09 -04:00
Daniël de Kok	9256d7c38c	Some small fixes for the Torch 2.4.0 update (#2304 ) * Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update	2024-07-25 13:34:44 +02:00
drbh	5d85a958c9	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-07-24 15:32:14 -04:00
Daniël de Kok	93d2b9fe9c	Split up `layers.marlin` into several files (#2292 ) The marlin.py file was getting large, split it up.	2024-07-24 16:33:26 +02:00
Daniël de Kok	4ab4173767	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-07-23 17:18:54 +02:00
Daniël de Kok	9935720c87	Add support for repacking AWQ weights for GPTQ-Marlin (#2278 ) * Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.	2024-07-23 13:08:20 +02:00
OlivierDehaene	5fca30ee15	fix(l4): fix fp8 logic on l4 (#2277 ) * fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89	2024-07-23 11:24:29 +02:00
Nicolas Patry	6aeb669072	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-07-22 18:27:10 +02:00
OlivierDehaene	4844ff790a	fix(server): fix fp8 weight loading (#2268 ) * fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype	2024-07-22 15:51:32 +00:00
OlivierDehaene	53ec0b790b	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-07-20 19:02:04 +02:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
Daniël de Kok	ba291dad9f	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-07-19 09:37:39 +02:00
Daniël de Kok	dbb23fbfa8	Use symmetric quantization in the `quantize` subcommand (#2120 ) Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.	2024-07-12 12:20:12 +02:00
SeongBeomLEE	c46eaf707b	[fix] Modifying base in yarn embedding (#2212 )	2024-07-12 10:04:51 +02:00
Daniël de Kok	cb150eb295	Add support for FP8 on compute capability >=8.0, <8.9 (#2213 ) Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs with compute capability >=8.0 and <8.9. Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>	2024-07-11 16:03:26 +02:00
Daniël de Kok	8511669cb2	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-07-09 20:04:03 +02:00
Aaron Mihalik	c6bcadf883	Adding "longrope" for Phi-3 (#2172 ) (#2179 ) Adding "longrope" for phi-3	2024-07-05 09:46:41 +02:00
Nicolas Patry	dea9c0dc74	Fixing rocm. (#2164 )	2024-07-02 12:01:08 +02:00
Wang, Yi	5d97e0c4a3	fix FlashDecoding change's regression in intel platform (#2161 ) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-02 11:56:07 +02:00
Nicolas Patry	4327210e6b	[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-07-01 23:28:00 +02:00
Wang, Yi	5da4cfab1c	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 ) * refine get xpu free memory Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-01 14:32:54 +02:00
Daniël de Kok	2ce8019480	Use GPTQ-Marlin for supported GPTQ configurations (#2111 ) GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).	2024-07-01 12:59:12 +02:00

1 2

76 Commits