hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
drbh	f852190060	fix: prefer hidden_activation over hidden_act in gemma2 (#2381 )	2024-08-08 14:08:56 -04:00
Wang, Yi	689b1abbf6	fix EleutherAI/gpt-neox-20b does not work in tgi (#2346 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-08 12:08:52 -04:00
drbh	a379d5536b	Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) (#2371 ) * Fix the bug * fix: run lints * fix: small syntax tweak --------- Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com>	2024-08-07 23:14:02 -04:00
drbh	21267f3ca3	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-07 21:32:37 -04:00
almersawi	8094ecfc9e	fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig (#2350 ) Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>	2024-08-07 19:45:23 -04:00
drbh	133015f408	fix: prefer original layernorm names for 180B (#2365 )	2024-08-06 15:25:30 -04:00
drbh	a64d407d64	fix: default num_ln_in_parallel_attn to one if not supplied (#2364 )	2024-08-06 13:33:22 -04:00
Daniël de Kok	47447ef017	Unify attention output handling (#2343 ) - Always return the hidden states. - Create the output tensor inside the `attention` and `paged_attention` functions. This removes the difference between how the output is handled between attention (output parameter) and paged attention (return value). This also removes the assumption that the attention implementation can write to an output tensor (in preparation of FlashInfer).	2024-08-01 17:03:28 +02:00
Wang, Yi	9ab9937414	enable HuggingFaceM4/idefics-9b in intel gpu (#2338 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-01 11:08:36 +02:00
drbh	bab02ff2bc	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-07-26 10:29:09 -04:00
Daniël de Kok	4b49c50f4c	Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313 )	2024-07-26 14:57:24 +02:00
Wang, Yi	8642250602	fix of use of unquantized weights in cohere GQA loading, also enable … (#2291 ) fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:44:02 +02:00
Wang, Yi	5ad39dd3c3	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:39:08 +02:00
shaltielshmid	3961e32390	[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-23 15:00:07 +02:00
Nicolas Patry	abc32537ea	Fixing mistral nemo. (#2276 )	2024-07-23 11:16:03 +02:00
Nicolas Patry	6aeb669072	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-07-22 18:27:10 +02:00
icyboy™	4e4207224e	Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading	2024-07-22 11:31:00 +02:00
OlivierDehaene	f3435bab8c	fix(server): fix deepseekv2 loading (#2266 )	2024-07-21 18:48:04 +02:00
OlivierDehaene	53ec0b790b	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-07-20 19:02:04 +02:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
Daniël de Kok	3b41e93a09	Hotfix: fix MPT after recent refactor (#2257 )	2024-07-19 14:42:35 +02:00
Daniël de Kok	18db78f295	Hotfix: various GPT-based model fixes (#2256 )	2024-07-19 14:42:19 +02:00
Daniël de Kok	80adb5be16	Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255 )	2024-07-19 12:55:59 +02:00
Daniël de Kok	ba291dad9f	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-07-19 09:37:39 +02:00
OlivierDehaene	1d1b1efa01	fix(server): fix cohere (#2249 )	2024-07-18 16:00:13 +02:00
Daniël de Kok	06d0e880e0	Add support for AWQ-quantized Idefics2 (#2233 ) Fixes #2036.	2024-07-16 07:58:25 +02:00
Daniël de Kok	8511669cb2	Move quantized weight handling out of the `Weights` class (#2194 ) Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow.	2024-07-09 20:04:03 +02:00
Daniël de Kok	5c7c9f1390	Falcon/DBRX: get correct number of key-value heads (#2205 )	2024-07-08 13:22:38 +02:00
icyboy™	521d0d990f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-07-08 09:01:14 +02:00
Daniël de Kok	05c094fcfa	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-07-05 16:07:48 +02:00
Daniël de Kok	b67d46336e	Fix Starcoder2 after refactor (#2189 )	2024-07-05 12:22:45 +02:00
Nicolas Patry	853d4eb9cf	Hotfixing after refactor.	2024-07-05 09:25:29 +00:00
Nicolas Patry	fb2f74e2b9	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-07-05 10:29:56 +02:00
Nicolas Patry	0759ec495e	Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167 )	2024-07-02 14:26:47 +02:00
drbh	b966bc0d35	fix: use the base layers weight in mistral rocm (#2155 )	2024-07-02 11:56:25 +02:00
Nicolas Patry	4327210e6b	[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-07-01 23:28:00 +02:00
Nicolas Patry	4f55f15840	Fixing baichuan override. (#2158 )	2024-07-01 23:25:54 +02:00
drbh	25f57e2e98	fix: use weights from base_layer (#2141 )	2024-07-01 12:58:40 +02:00
Nicolas Patry	3ea8259af1	Fixing gemma2. (#2135 ) * Fixing gemma2. * Adding new model.	2024-06-27 16:04:20 +02:00
Daniël de Kok	dd2d91b043	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-06-27 15:54:35 +02:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
Nicolas Patry	9e2fdf57c0	Removing IPEX_AVAIL. (#2115 ) * Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN	2024-06-25 13:20:57 +02:00
Wang, Yi	b64c70c9e7	Cpu tgi (#1936 ) * add CPU tgi support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>	2024-06-25 12:21:29 +02:00
Daniël de Kok	f5a9837592	Support exl2-quantized Qwen2 models (#2085 ) Fixes #2081.	2024-06-20 07:56:16 +02:00
Daniël de Kok	093a27c528	Add support for GPTQ Marlin (#2052 ) Add support for GPTQ Marlin kernels GPTQ Marlin extends the Marlin kernels to support common GPTQ configurations: - bits: 4 or 8 - groupsize: -1, 32, 64, or 128 - desc_act: true/false Using the GPTQ Marlin kernels requires repacking the parameters in the Marlin quantizer format. The kernels were contributed by Neural Magic to VLLM. We vendor them here for convenience.	2024-06-14 09:45:42 +02:00
OlivierDehaene	521de6cacd	fix(server): fix OPT implementation (#2061 )	2024-06-12 18:22:20 +02:00
Daniël de Kok	85dfc39222	Add Phi-3 medium support (#2039 ) Add support for Phi-3-medium The main difference between the medium and mini models is that medium uses grouped query attention with a packed QKV matrix. This change adds support for GQA with packed matrixes to `Weights.get_weights_col_packed` and uses it for Phi-3. This also allows us to remove the custom implementation of GQA from dbrx attention loading.	2024-06-10 09:22:29 +02:00
Daniël de Kok	4594e6faba	Add support for Marlin-quantized models This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit. Tested with: - Llama 2 - Llama 3 - Phi 3	2024-06-06 13:16:52 +02:00
OlivierDehaene	8aece3bd68	feat: move allocation logic to rust (#1835 ) Close #2007	2024-06-05 12:18:38 +02:00
Daniël de Kok	9b52f0e2dc	Fix Phi-2 with `tp>1` (#2003 ) # What does this PR do? We were using the wrong parallelism in the up-projection. <!-- Remove if not applicable --> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-04 14:26:07 +02:00

1 2 3 4

161 Commits