Commit Graph

802 Commits

Author SHA1 Message Date
Mohit Sharma 0a5b19a3ed updated doc 2024-07-02 13:10:26 +00:00
Mohit Sharma 6d6b0bdcc4 fix formatting 2024-07-02 13:08:56 +00:00
Mohit Sharma f34560f74a updated docs 2024-07-02 12:50:39 +00:00
Mohit Sharma bf4db77103 updated doc 2024-06-25 16:15:03 +00:00
Mohit Sharma 5e38d3534c update launcher 2024-06-25 15:45:04 +00:00
Mohit Sharma 15b351b4a9 updated doc 2024-06-25 15:35:49 +00:00
Mohit Sharma 1e6e7db02e add AMMO example 2024-06-25 14:58:45 +00:00
Mohit Sharma a7909e6f94 add torch dtype 2024-06-25 14:12:29 +00:00
Mohit Sharma f4714a8f98 remove example 2024-06-25 07:08:37 +00:00
Mohit Sharma 3ae62304ab revert makefile 2024-06-24 15:22:10 +00:00
Mohit Sharma 034686b178 update heading 2024-06-24 15:20:45 +00:00
Mohit Sharma e81c4cf863 update launcher 2024-06-24 15:09:17 +00:00
Mohit Sharma 3cc2f4e9fa update doc 2024-06-24 14:50:16 +00:00
Mohit Sharma 001ec09df3 rename doc 2024-06-24 14:38:30 +00:00
Mohit Sharma 50806ffe4a update port 2024-06-24 14:37:29 +00:00
Mohit Sharma 557e18e08c fix style 2024-06-24 14:30:26 +00:00
Mohit Sharma 8a0bb53ef3 add docs 2024-06-24 11:09:17 +00:00
Mohit Sharma f0d95b0f4b fixrs 2024-06-24 11:07:32 +00:00
Mohit Sharma fb83e3416b fix 2024-06-24 08:25:59 +00:00
Mohit Sharma 81fd601c44 rebase and update 2024-06-24 08:15:36 +00:00
Mohit Sharma 084de9907c Merge branch 'main' into fp8_kvcache 2024-06-24 07:53:33 +00:00
Daniël de Kok bcb3faa1c2
Factor out sharding of packed tensors (#2059)
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
2024-06-20 09:56:04 +02:00
Daniël de Kok f5a9837592
Support exl2-quantized Qwen2 models (#2085)
Fixes #2081.
2024-06-20 07:56:16 +02:00
drbh cdbf802860
feat: rotate tests ci token (#2091) 2024-06-19 17:02:58 -04:00
Daniël de Kok 11ea9ce002
CI: pass pre-commit hooks again (#2084) 2024-06-18 09:38:21 +02:00
Guillaume LEGENDRE 4f25c67d63
CI: Tailscale improvements (#2079)
* test local tailscale

* Update build.yaml

* Update build.yaml

* Update build.yaml

* Update build.yaml

* wait for ssh

* network host

* change step order
2024-06-18 09:13:04 +02:00
Daniël de Kok c8c7ccd31e
Set maximum grpc message receive size to 2GiB (#2075)
* Set maximum grpc message receive size to 2GiB

The previous default was 4MiB, which doesn't really work well for
multi-modal models.

* Update to Rust 1.79.0

* Fixup formatting to make PR pass
2024-06-17 16:40:44 +02:00
Ziru Niu 0f7d38e774
fix build.rs watch files (#2072) 2024-06-17 12:10:01 +02:00
Lysandre Debut 131838919e
Contributing guide & Code of Conduct (#2074)
* Contributing guide & Code of Conduct

* Redirect to GitHub's tutorial on PRs
2024-06-17 12:09:31 +02:00
Daniël de Kok e903770897
Support different image sizes in prefill in VLMs (#2065)
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.

Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-06-17 10:49:41 +02:00
Alvaro Moran 445f313504
Adding architecture document (#2044)
* doc: adding architecture document

* doc: add architecture to toctree

* fix: avoid cargo lock changes

* fix: avoid cargo lock tweak

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-06-14 09:28:34 -04:00
Tiezhen WANG 96b7b40ca3
Update the link for qwen2 (#2068)
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-06-14 11:59:33 +02:00
Daniël de Kok 093a27c528
Add support for GPTQ Marlin (#2052)
Add support for GPTQ Marlin kernels

GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:

- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false

Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.

The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-06-14 09:45:42 +02:00
drbh f433f1f770
implement Open Inference Protocol endpoints (#1942)
* feat: add kserve feature and basic routes

* feat: implement infer endpoint wrapper around generate

* fix: refactor and improve types

* fix: improve infer and simplify

* fix: cleanup and improve api docs

* fix: refactor and encapsulate kserve feat in file

* fix: remove typos after rebase
2024-06-13 12:51:51 -04:00
drbh 42aa8ee1bb
PR #2049 CI run (#2054)
* Use minijinja's pycompat mode for python methods

* fix: cargo fmt lint for pre commit

---------

Co-authored-by: Armin Ronacher <armin.ronacher@active-4.com>
2024-06-13 11:53:49 -04:00
OlivierDehaene 90184df79c
fix(layers): fix SuRotaryEmbedding (#2060)
* fix(layers): fix SuRotaryEmbedding

* change arange

* remove logs
2024-06-12 18:24:47 +02:00
OlivierDehaene 521de6cacd
fix(server): fix OPT implementation (#2061) 2024-06-12 18:22:20 +02:00
drbh 376a0b7ada
Support chat response format (#2046)
* feat: support response_format in chat

* fix: adjust typos

* fix: add trufflehog lint
2024-06-11 10:44:56 -04:00
fxmarty a6e4d63c86
Update LLMM1 bound (#2050)
update commit
2024-06-11 19:30:29 +08:00
Luc Georges dfca1dfc5e
fix(ci): remove unnecessary permissions (#2045) 2024-06-10 12:16:53 -04:00
Luc Georges 4e74ec09a8
feat(ci): add trufflehog secrets detection (#2038) 2024-06-10 11:54:13 -04:00
Daniël de Kok 85dfc39222
Add Phi-3 medium support (#2039)
Add support for Phi-3-medium

The main difference between the medium and mini models is that medium
uses grouped query attention with a packed QKV matrix. This change adds
support for GQA with packed matrixes to `Weights.get_weights_col_packed`
and uses it for Phi-3. This also allows us to remove the custom
implementation of GQA from dbrx attention loading.
2024-06-10 09:22:29 +02:00
fxmarty 9b3674d903
ROCm and sliding windows fixes (#2033)
* update vllm commit & fix models using sliding window

* update

* update commit

* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled

* enable tunableop by default

* fix sliding window

* address review

* dead code

* precise comment

* is it flaky?
2024-06-10 15:09:50 +08:00
Daniël de Kok bf3c813782 server: use chunked inputs
The router will now send the input as chunks besides as a single
string. This change modifies the server to process chunked input
rather than strings. This also allows us to remove the image
extraction code from the server.
2024-06-07 08:09:04 +02:00
Wang, Yi 4dabddb7ea
Xpu gqa (#2013)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-06-06 19:12:57 +02:00
Nicolas Patry 9765658212 Revert "Enabling CI for AMD with new runner.."
This reverts commit 101ac9a760.
2024-06-06 19:08:16 +02:00
Nicolas Patry 101ac9a760 Enabling CI for AMD with new runner.. 2024-06-06 19:07:48 +02:00
Nicolas Patry ed1cfde0d8
Internal runner ? (#2023)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2024-06-06 18:51:42 +02:00
Daniël de Kok 51621439a4 marlin: improve build 2024-06-06 17:19:46 +02:00
Daniël de Kok 0d96468ebb marlin: support tp>1 when group_size==-1 2024-06-06 17:19:28 +02:00