Commit Graph

1192 Commits

Author SHA1 Message Date
drbh 23bc38b10d
fix: include add_special_tokens in kserve request ()
merging as this patch is already used, and fully limit to the kserve feature
2024-12-19 16:55:17 -05:00
Wang, Yi ab5f616920
change xpu lib download link ()
Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>
2024-12-19 12:18:58 +01:00
Mohit Sharma 8f66d323d0
Update vllm kernels for ROCM ()
* (vllm) updated vllm rocm kernels

* revert silu

* update partition size

* remove grouped_topk

* (nit) remove log

* update moe-kernels commit
2024-12-18 12:44:42 +01:00
janne-alatalo 7eeefa3b57
Qwen2-VL runtime error fix when prompted with multiple images ()
* Fix runtime error when Qwen2-VL was prompted with multiple images

Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:

 File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 459, in get_position_ids
    text_pos_ids = torch.arange(text_length, device=d)
RuntimeError: upper bound and larger bound inconsistent with step sign

The error was caused by text_length variable going to negative value
when multiple images caused multiple loops in the get_position_ids
function's main loop.

The error is a simple logic mistake where next_image_pos is initialized
as relative offset from current_pos, but was used like it was absolute
position from zero.

* Fix runtime error when Qwen2-VL was prompted with multiple images

Fix runtime error when Qwen2-VL model is prompted with prompt with more
than one image. The runtime error was:

File "text-generation-inference/server/text_generation_server/models/custom_modeling/qwen2_vl.py", line 534, in forward
    inputs_embeds[input_ids == self.image_token_id] = image_embeds
RuntimeError: shape mismatch: value tensor of shape [512, 3584] cannot be broadcast to indexing result of shape [1024, 3584]

(The error message shape numbers can be different depending on the input
image resolutions)

The error was caused by adding the wrong number of <|image_pad|> tokens
to the tokenized input in the image_text_replacement function.

The error is a simple logical mistake where the number of image pad
tokens is checked from pixel_value_shape tensor's first dimension
length. However, the pixel_value_shape contains patches from all of the
images. Therefore the code added the total number of required image pad
tokens for the whole input to each of the images locations. This
resulted to extra image pad tokens to be present in the tokenized input.

The fix was to check the number of required tokens from the
image_grid_thw tensor. The tensor includes grid_t, grid_h, and grid_w
values for each image. grid_t * grid_h * grid_w results to the total
number of patches for the image [1]. The number of required image pad
tokens is number_of_patches // 4.

[1] 31f9a289a6/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py (L311)

---------

Co-authored-by: Janne Alatalo <janne.alatalo@jamk.fi>
2024-12-16 22:55:11 -05:00
drbh a72f339c79
fix: lint backend and doc files () 2024-12-16 16:12:34 -05:00
Nicolas Patry 11ab329883
Fixing CI. () 2024-12-16 10:58:15 +01:00
Nicolas Patry 6f0b8c947d
New arg. () 2024-12-16 10:34:50 +01:00
Hugo Larcher 1708865fdc
Feat/trtllm cancellation dev container ()
Add devcontainers for TRTLLM backend.

---------

Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>
2024-12-13 16:19:06 +01:00
Funtowicz Morgan ea7f4082c4
TensorRT-LLM backend bump to latest version + misc fixes ()
* misc(cmake) update dependencies

* feat(hardware) enable new hardware.hpp and unittests

* test(ctest) enable address sanitizer

* feat(backend): initial rewrite of the backend for simplicity

* feat(backend): remove all the logs from hardware.hpp

* feat(backend): added some logging

* feat(backend): enable compiler warning if support for RVO not applying

* feat(backend): missing return statement

* feat(backend): introduce backend_workspace_t to store precomputed information from the engine folder

* feat(backend): delete previous backend impl

* feat(backend): more impl

* feat(backend): use latest trtllm main version to have g++ >= 13 compatibility

* feat(backend): allow overriding which Python to use

* feat(backend): fix backend_exception_t -> backend_error_t naming

* feat(backend): impl missing generation_step_t as return value of pull_tokens

* feat(backend): make backend_workspace_t::engines_folder constexpr

* feat(backend): fix main.rs retrieving the tokenizer

* feat(backend): add guard to multiple header definitions

* test(backend): add more unittest

* feat(backend): remove constexpr from par

* feat(backend): remove constexpig

* test(backend): more test coverage

* chore(trtllm): update dependency towards 0.15.0

* effectively cancel the request on the executor

* feat(backend) fix moving backend when pulling

* feat(backend): make sure we can easily cancel request on the executor

* feat(backend): fix missing "0" field access

* misc(backend): fix reborrowing Pin<&mut T> as described in the doc https://doc.rust-lang.org/stable/std/pin/struct.Pin.html#method.as_mut

* chore: Add doc and CI for TRTLLM ()

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* chore: Add doc and CI for TRTLLM

* doc: Formatting

* misc(backend): indent

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
2024-12-13 15:50:59 +01:00
Nicolas Patry 3bb3fd19ae
Fixup opt to reduce the amount of odd if statements. ()
* Fixup opt to reduce the amount of odd if statements.

* Fixing cargo lock
2024-12-12 18:20:13 +01:00
Wang, Yi bf59118a93
fix facebook/opt-125m not working issue ()
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-12-12 14:41:30 +01:00
Nicolas Patry c3bd7212c2
Fixing latest flavor by disabling it. () 2024-12-12 14:09:35 +01:00
Guspan Tanadi f01f2fb6e7
docs(README): supported hardware links TGI AMD GPUs () 2024-12-12 13:49:33 +01:00
Nicolas Patry 07b01293c5
Prepare patch release. () 2024-12-11 21:03:50 +01:00
RodriMora cc66dccbe8
Update README.md ()
Added instructions to clone the repo and change directory into it. 

In following steps there is a "make install" step that would fail if people have not cloned the repo and cd into it, so it may be confusing for some

Added python venv alternative to conda too.
2024-12-11 19:45:49 +01:00
Nicolas Patry 82c24f7420
Using both value from config as they might not be correct. ()
* Using both value from config as they might not be correct.

* Fixing max_position_embeddings for falcon.

* Simple attempt to fix the healthcheck block allocation.

* Much simpler solution.

* Default value for Backend start_health
2024-12-10 19:37:09 +01:00
Nicolas Patry a2d878fa0f
Small update to docs () 2024-12-10 10:46:26 +01:00
Nicolas Patry b2fac5d947
Hotfix link2 ()
2nd hotfix ?
2024-12-09 20:57:18 +01:00
Nicolas Patry a70dd2998b
Hotfixing the link. () 2024-12-09 20:50:07 +01:00
Nicolas Patry 042791fbd5
Prep new version ()
* New version.

* Link fixup.

* Update docs.

* FIxup.
2024-12-09 20:42:42 +01:00
Nicolas Patry 27fa83ca5b
V3 doc ()
* V3 document.

* Updating asset.
2024-12-09 19:58:07 +01:00
Nicolas Patry a04356fb8c
Attempt for cleverer auto batch_prefill values (some simplifications). ()
* Attempt for cleverer auto batch_prefill values (some simplifications).

* Less flaky tests.

* Fixing typo insertion.

* Update launcher/src/main.rs

Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Adding small comment for source of calculation.

* Adding L40.

* Adding L40s.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-12-09 19:44:32 +01:00
drbh 9f5c9a5e22
Enable paligemma2 ()
* feat: support loading gemma2 as vlm text model

* feat: add test for paligemma2
2024-12-06 14:41:49 -05:00
Nicolas Patry 08f6fa0b59
Removing experimental to prefill chunking. 2024-12-06 19:09:40 +01:00
Nicolas Patry d96dcb1797
Adding A100 compute. () 2024-12-06 18:19:15 +01:00
Nicolas Patry 5df8059037
Auto max prefill ()
* Attempt at automatic max batch prefill.

* Taking into account number of shards.

* Adding more cards.

* Adding A100 + H100

* Adding a few more cards.

* Logprobs cost too much.

* h100 better name, and keep factor of 2

* Damn inflated sparse tflops.

* Typo in h100.

* Updated the flops calculation (checked with fvcore).

* chunking by default.

* Fix prefix caching for chat completion since we removed logprobs.

* More tests.

* Dropping all the prefill logprobs.

* Add a flag that enables users to get logprobs back.

* Repairing prompt token counting.

* Fixing a few tests.

* Remove some scaffolding.

* Attempting to reduces the issues (workarounds for now).
2024-12-06 05:52:00 +01:00
OlivierDehaene 8c3669b287
feat: auto max_new_tokens ()
* feat: auto max_new_tokens

* update default

* Fixing the tests.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-12-06 05:50:35 +01:00
Wang, Yi 6685e8fcda
use oneapi 2024 docker image directly for xpu ()
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-12-06 09:36:23 +05:30
drbh e0db633396
fix: avoid setting use_sgmv if no kernels present () 2024-12-04 15:26:09 -05:00
Nicolas Patry b57f370386
Saving some VRAM. ()
* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.
2024-12-03 04:04:21 +01:00
Daniël de Kok 2003d8be0c
Sync (most) server dependencies with Nix ()
* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-12-03 04:04:06 +01:00
Dmitry Rogozhkin 535149d872
fix: only use eos_token_id as pad_token_id if int ()
LLama 3 has a list of values as eos_token_id:
  "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
This breaks tokenizer since it expects single value. This
commit uses tokenizer.eos_token_id instead in such a case.

Fixes: 

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2024-12-02 06:26:37 +01:00
drbh 2c74c55637
fix: add merge-lora arg for model id () 2024-12-02 05:52:02 +01:00
Torsten Raudssus a35d1e6fe5
Removing ../ that broke the link () 2024-12-02 05:48:55 +01:00
Nicolas Patry 1d2cb356b9
Fix doc. () 2024-12-02 05:28:26 +01:00
drbh d471805134
Support continue final message ()
* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test
2024-11-27 19:13:30 -05:00
jp caff779dd4
Fix: docs typo ()
Fix: typo in model loading code

Fix typo in model loading code
2024-11-26 14:28:58 +01:00
Wang, Yi 892a26e549
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… ()
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)

Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>
2024-11-26 14:28:11 +01:00
Daniël de Kok 72ab60fdd5
Use FP8 KV cache when specified by compressed-tensors ()
The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).
2024-11-26 08:27:41 +01:00
Daniël de Kok 289aa48554
Move JSON grammar -> regex grammar conversion to the router ()
* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
2024-11-25 18:47:34 +01:00
drbh c637d68d74
feat: concat the adapter id to the model id in chat response ()
* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response
2024-11-25 12:36:31 -05:00
OlivierDehaene 780531ec77
chore: prepare 2.4.1 release ()
* chore: prepare 2.4.1 release

* fix tests

* fmt
2024-11-22 17:26:15 +00:00
Daniël de Kok e87893d38e
chore: Update to marlin-kernels 0.3.6 ()
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
2024-11-22 14:44:47 +00:00
OlivierDehaene ab7ccf5bc3
feat: add payload limit ()
* feat: add payload limit

* update launcher
2024-11-21 18:20:15 +00:00
Hugo Larcher d5bc6a20bd
feat: Add automatic nightly benchmarks ()
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
2024-11-21 17:11:42 +00:00
Lucain d012f229c6
Remove guideline from API () 2024-11-21 16:56:38 +00:00
Daniël de Kok c5b5b3a11c
docs: Add a README section about using Nix () 2024-11-21 16:53:27 +00:00
drbh faa10ad0bc
fix: tweak grammar test response () 2024-11-21 16:46:00 +00:00
OlivierDehaene 8e0c161d0a
fix: incomplete generations w/ single tokens generations and models that did not support chunking ()
* Incomplete generation stream fix ()

entries.len() could > batch.size in prefill, so need to filter as well.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
2024-11-21 16:37:55 +00:00
Daniël de Kok 3c54488638
nix: downgrade to outlines 0.1.3 () 2024-11-21 13:00:26 +01:00