Commit Graph

278 Commits

Author SHA1 Message Date
drbh bab02ff2bc
feat: add ruff and resolve issue (#2262)
* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check
2024-07-26 10:29:09 -04:00
Daniël de Kok 4b49c50f4c
Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) 2024-07-26 14:57:24 +02:00
drbh 5d85a958c9
fix: refactor adapter weight loading and mapping (#2193)
* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo
2024-07-24 15:32:14 -04:00
Wang, Yi 8642250602
fix of use of unquantized weights in cohere GQA loading, also enable … (#2291)
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-24 10:44:02 +02:00
Wang, Yi 5ad39dd3c3
fix crash in multi-modal (#2245)
* fix crash in multi-modal

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-24 10:39:08 +02:00
shaltielshmid 3961e32390
[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254)
* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-23 15:00:07 +02:00
Nicolas Patry abc32537ea
Fixing mistral nemo. (#2276) 2024-07-23 11:16:03 +02:00
Nicolas Patry 6aeb669072
Softcapping for gemma2. (#2273)
* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.
2024-07-22 18:27:10 +02:00
OlivierDehaene 4844ff790a
fix(server): fix fp8 weight loading (#2268)
* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype
2024-07-22 15:51:32 +00:00
icyboy™ 4e4207224e
Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading
2024-07-22 11:31:00 +02:00
OlivierDehaene f3435bab8c
fix(server): fix deepseekv2 loading (#2266) 2024-07-21 18:48:04 +02:00
OlivierDehaene 53ec0b790b
feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248)
* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout
2024-07-20 19:02:04 +02:00
Daniël de Kok e52be9bba2
Add support for Deepseek V2 (#2224)
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
2024-07-19 17:23:20 +02:00
Daniël de Kok 3f37a66774
Hotfix: pass through model revision in `VlmCausalLM` (#2258) 2024-07-19 15:59:00 +02:00
Daniël de Kok 3b41e93a09
Hotfix: fix MPT after recent refactor (#2257) 2024-07-19 14:42:35 +02:00
Daniël de Kok 18db78f295
Hotfix: various GPT-based model fixes (#2256) 2024-07-19 14:42:19 +02:00
Daniël de Kok 80adb5be16
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) 2024-07-19 12:55:59 +02:00
Daniël de Kok ba291dad9f
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 09:37:39 +02:00
OlivierDehaene 1d1b1efa01
fix(server): fix cohere (#2249) 2024-07-18 16:00:13 +02:00
Daniël de Kok 06d0e880e0
Add support for AWQ-quantized Idefics2 (#2233)
Fixes #2036.
2024-07-16 07:58:25 +02:00
Daniël de Kok 8511669cb2
Move quantized weight handling out of the `Weights` class (#2194)
Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.

This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.

I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
2024-07-09 20:04:03 +02:00
Daniël de Kok 5c7c9f1390
Falcon/DBRX: get correct number of key-value heads (#2205) 2024-07-08 13:22:38 +02:00
Daniël de Kok 153fcf7739
Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes
Starcoder et al.
2024-07-08 11:19:48 +02:00
Daniël de Kok cce475a949
hotfix: Fix number of KV heads (#2202)
Fix number of KV heads
2024-07-08 09:52:12 +02:00
icyboy™ 521d0d990f
fix dbrx & opt model prefix bug (#2201)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug
2024-07-08 09:01:14 +02:00
Daniël de Kok 05c094fcfa
Consistently take `prefix` in model constructors (#2191)
* Consistently take `prefix` in model constructors

* Release test check fix

* Misc refactor-related fixes
2024-07-05 16:07:48 +02:00
Daniël de Kok b67d46336e
Fix Starcoder2 after refactor (#2189) 2024-07-05 12:22:45 +02:00
Nicolas Patry 853d4eb9cf
Hotfixing after refactor. 2024-07-05 09:25:29 +00:00
Nicolas Patry fb2f74e2b9
Refactor dead code - Removing all `flash_xxx.py` files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-07-05 10:29:56 +02:00
Nicolas Patry 0759ec495e
Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167) 2024-07-02 14:26:47 +02:00
drbh b966bc0d35
fix: use the base layers weight in mistral rocm (#2155) 2024-07-02 11:56:25 +02:00
Nicolas Patry 022f6515a4
Fixing graph capture for flash decoding. (#2163) 2024-07-02 11:43:07 +02:00
Nicolas Patry 4327210e6b
[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940)
* Using flash decoding

Conditional flashdecoding.

Fix max_q.

Working kvcache

Working version with flash decoding.

Make it work for mistral.

Fix after rebase..

Less intrusive.

REvert changes in modeling.

Speedup flashdecoding.

HHachweew
Hack to make other models work.

Fixing non flash decoding llama path.

Router logic knows about page size.

Missing 2 models.

Missing cohere.

Fixing cohere flash decoding.

Revamped all this architecture.

Fix cohere.

Fixing falcon.

Enabling custom block size schedule.

Update router/src/infer.rs

Not sending preallocated output.

* Making it work on non flash decoding.

* Fix Cohere.

* Fix non decoding paths.

* Rebased.

* No need for cache_manager anymore.

* Update?

* "ipex" -> "cpu"

* These do not belong.

* Factoring cu_seqlen_qk for better abstracting over every model.

* Fixing non flash tests/imports.

* Changing return everywhere.

* Update mistral past.

* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).

* Fixup mistral clamping (had issues with cuda graphs).

* No need to recreate anything actually.
2024-07-01 23:28:00 +02:00
Nicolas Patry 4f55f15840
Fixing baichuan override. (#2158) 2024-07-01 23:25:54 +02:00
Wang, Yi 5da4cfab1c
refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132)
* refine get xpu free memory

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable qwen2 in xpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable gemma/gemma2/phi in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-01 14:32:54 +02:00
icyboy™ 9d0ca503a8
fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123)
https://github.com/huggingface/text-generation-inference/issues/2122
2024-07-01 14:17:22 +02:00
drbh 25f57e2e98
fix: use weights from base_layer (#2141) 2024-07-01 12:58:40 +02:00
Nicolas Patry 3ea8259af1
Fixing gemma2. (#2135)
* Fixing gemma2.

* Adding new model.
2024-06-27 16:04:20 +02:00
Daniël de Kok dd2d91b043
Idefics2: sync added image tokens with transformers (#2080)
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
2024-06-27 15:54:35 +02:00
drbh 04e1af94d7
Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-06-25 14:46:27 -04:00
Wang, Yi e563983d90
fix cpu and xpu issue (#2116)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-06-25 16:47:06 +02:00
Nicolas Patry 9e2fdf57c0
Removing IPEX_AVAIL. (#2115)
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
2024-06-25 13:20:57 +02:00
Wang, Yi b64c70c9e7
Cpu tgi (#1936)
* add CPU tgi support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2024-06-25 12:21:29 +02:00
drbh 811a9381b1
feat: sort cuda graphs in descending order (#2104) 2024-06-21 14:28:26 -04:00
Daniël de Kok f5a9837592
Support exl2-quantized Qwen2 models (#2085)
Fixes #2081.
2024-06-20 07:56:16 +02:00
Daniël de Kok e903770897
Support different image sizes in prefill in VLMs (#2065)
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-06-17 10:49:41 +02:00
Tiezhen WANG 96b7b40ca3
Update the link for qwen2 (#2068)
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-06-14 11:59:33 +02:00
Daniël de Kok 093a27c528
Add support for GPTQ Marlin (#2052)
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-06-14 09:45:42 +02:00
OlivierDehaene 90184df79c
fix(layers): fix SuRotaryEmbedding (#2060)
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
2024-06-12 18:24:47 +02:00
OlivierDehaene 521de6cacd
fix(server): fix OPT implementation (#2061) 2024-06-12 18:22:20 +02:00