Commit Graph

478 Commits

Author SHA1 Message Date
drbh a7556ba800 fix: refactors and helpful comments 2024-06-24 13:39:56 +00:00
drbh a07b612989 fix: revert skips and prefer updated ci token for tests 2024-06-19 17:31:13 +00:00
drbh c9e4526b9d fix: skip llama test CI (temp) 2 2024-06-19 17:19:40 +00:00
drbh 4f1543d3c7 fix: refactors and adjust flash llama lora logic 2024-06-19 16:13:42 +00:00
drbh 224455f389 Merge branch 'main' into lora-internal 2024-06-18 09:50:41 -04:00
Daniël de Kok c8c7ccd31e
Set maximum grpc message receive size to 2GiB (#2075)
* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass
2024-06-17 16:40:44 +02:00
Daniël de Kok e903770897
Support different image sizes in prefill in VLMs (#2065)
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-06-17 10:49:41 +02:00
drbh 1104885f00
Merge branch 'main' into lora-internal 2024-06-14 10:06:15 -04:00
drbh 0e1c28cafd fix: merge 'main' into lora-internal to resolve conflicts 2024-06-14 14:02:33 +00:00
Tiezhen WANG 96b7b40ca3
Update the link for qwen2 (#2068)
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-06-14 11:59:33 +02:00
Daniël de Kok 093a27c528
Add support for GPTQ Marlin (#2052)
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-06-14 09:45:42 +02:00
drbh aa88c4fd3a fix: add lora kernel to dockerfile, support running without kernels and refactors 2024-06-14 00:35:07 +00:00
OlivierDehaene 90184df79c
fix(layers): fix SuRotaryEmbedding (#2060)
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
2024-06-12 18:24:47 +02:00
OlivierDehaene 521de6cacd
fix(server): fix OPT implementation (#2061) 2024-06-12 18:22:20 +02:00
fxmarty a6e4d63c86
Update LLMM1 bound (#2050)
update commit
2024-06-11 19:30:29 +08:00
drbh ce40ad26fd fix: add model_id to IdeficsCausalLM 2024-06-10 10:24:21 -04:00
drbh 101b95adc4 fix: update all models forwards to include adapter_data 2024-06-10 10:24:21 -04:00
drbh 1deb372564 fix: add adapter_data param to phi and neox 2024-06-10 10:24:21 -04:00
drbh b1169273fd fix: add adapter_data param and avoid missing layers 2024-06-10 10:24:21 -04:00
drbh 91f407226d feat: support if vlm models 2024-06-10 10:24:21 -04:00
drbh 611225f017 feat: support base model generation and refactors 2024-06-10 10:24:21 -04:00
drbh 68399c1ae3 feat: prefer model id in request 2024-06-10 10:23:52 -04:00
drbh de56a81c5c feat: add lora support to mistral and refactors 2024-06-10 10:23:52 -04:00
drbh 9c45d34983 fix: add model_id to model test 2024-06-10 10:23:52 -04:00
drbh dc0f76553c fix: pass model_id for all causal and seq2seq lms 2024-06-10 10:23:52 -04:00
drbh 88bd5c2c92 fix: pass model_id for all flash causal lms 2024-06-10 10:23:52 -04:00
drbh 73eb2ae255 fix: refactor and move changes to v3 proto 2024-06-10 10:23:52 -04:00
drbh c927376725 fix: adjust adapter_segments logic when in batch 2024-06-10 10:23:52 -04:00
drbh ad088d51fa fix: adjust batch for bgmv 2024-06-10 10:23:52 -04:00
drbh 8984ce6c69 feat: perfer loraxs custom punica kernels and add mlp loras 2024-06-10 10:23:52 -04:00
drbh d5f21d57d1 fix: prefer adapter_data and refactors 2024-06-10 10:23:52 -04:00
drbh 8b50f4b779 feat: prefer lorax implementation and port loading logic 2024-06-10 10:23:52 -04:00
drbh c661631225 feat: baseline impl single request multi lora support 2024-06-10 10:23:52 -04:00
drbh a046c303f7 fix: refactor and reduce lora math 2024-06-10 10:23:52 -04:00
drbh 0a6ea7fb57 feat: load weights within layer and refactor lora pass 2024-06-10 10:23:52 -04:00
drbh db3d8e6518 feat: first draft load multiple lora 2024-06-10 10:23:52 -04:00
Daniël de Kok 85dfc39222
Add Phi-3 medium support (#2039)
Add support for Phi-3-medium

The main difference between the medium and mini models is that medium
uses grouped query attention with a packed QKV matrix. This change adds
support for GQA with packed matrixes to `Weights.get_weights_col_packed`
and uses it for Phi-3. This also allows us to remove the custom
implementation of GQA from dbrx attention loading.
2024-06-10 09:22:29 +02:00
fxmarty 9b3674d903
ROCm and sliding windows fixes (#2033)
* update vllm commit & fix models using sliding window

* update

* update commit

* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled

* enable tunableop by default

* fix sliding window

* address review

* dead code

* precise comment

* is it flaky?
2024-06-10 15:09:50 +08:00
Daniël de Kok bf3c813782 server: use chunked inputs
The router will now send the input as chunks besides as a single
string. This change modifies the server to process chunked input
rather than strings. This also allows us to remove the image
extraction code from the server.
2024-06-07 08:09:04 +02:00
Daniël de Kok 51621439a4 marlin: improve build 2024-06-06 17:19:46 +02:00
Daniël de Kok 0d96468ebb marlin: support tp>1 when group_size==-1 2024-06-06 17:19:28 +02:00
Daniël de Kok 4594e6faba Add support for Marlin-quantized models
This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3
2024-06-06 13:16:52 +02:00
Daniël de Kok 3f4bcf978c
Fix GPTQWeight import (#2020)
# What does this PR do?

Fix stray import.

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-05 14:49:15 +02:00
Nicolas Patry 0a94fad79f
Fixing rocm. (#2021)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-05 14:41:34 +02:00
OlivierDehaene 8aece3bd68
feat: move allocation logic to rust (#1835)
Close #2007
2024-06-05 12:18:38 +02:00
Daniël de Kok 9ffe1f1e67
Do not initialize scratch space when there are no ExLlamaV2 layers (#2015)
# What does this PR do?

Do not attempt to allocate ExLlamaV2 scratch buffers when there are no
ExLlama2 layers. Avoids a crash in warmup for models that cannot use
exllama when ExLlamaV2 is installed.

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-05 10:45:47 +02:00
Nicolas Patry 824edf28d7
Hotfixing `make install`. (#2008)
# What does this PR do?

Fixes initial and subsequent installs (protection for folder creation
should only be for git commit, checking out correct commit should be on
both.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-04 23:34:03 +02:00
Nicolas Patry 8390e251d9
Making `make install` work better by default. (#2004)
# What does this PR do?

Making `make install` a much better sane default to start local dev
environments.

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-04 19:38:46 +02:00
Daniël de Kok d14eaacaca
Support GPTQ models with column-packed up/gate tensor (#2006)
# What does this PR do?

The GPTQ code path for column-packed packed tensors assumed that this is
always a QKV matrix. However, models (e.g. Phi-3) can also have
column-packed MLP up/gate matrices.

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-04 19:37:49 +02:00
OlivierDehaene 757223b352
feat: add SchedulerV3 (#1996)
- Refactor code to allow supporting multiple versions of the
generate.proto at the same time
- Add v3/generate.proto (ISO to generate.proto for now but allow for
future changes without impacting v2 backends)
- Add Schedule trait to abstract queuing and batching mechanisms that
will be different in the future
- Add SchedulerV2/V3 impl
2024-06-04 15:56:56 +02:00