Commit Graph

1079 Commits

Author SHA1 Message Date
Nicolas Patry b4654a36dc
Fixing up the tests ? 2024-09-16 17:01:51 +02:00
Nicolas Patry 5adece6313
This doesn't seem needed. 2024-09-16 17:01:51 +02:00
Nicolas Patry dd4b774e0d
New cargo lock 2024-09-16 17:01:44 +02:00
Nicolas Patry b7cb8d5145
Let's figure out the issue... 2024-09-16 17:01:30 +02:00
Nicolas Patry 3d7b81535a
Only link cuda driver librairies. 2024-09-16 17:01:30 +02:00
Nicolas Patry e898483db6
Updating outlines to 0.0.46 2024-09-16 17:01:30 +02:00
Nicolas Patry ce3efc83ed
Remove tmate. 2024-09-16 17:01:30 +02:00
Nicolas Patry 7f58f7dc61
Symlink all the things. 2024-09-16 17:01:29 +02:00
Nicolas Patry 42107de71f
Let's try to find libnvidia-ml 2024-09-16 17:01:29 +02:00
Nicolas Patry edaa7f847d
Does this work ? 2024-09-16 17:01:29 +02:00
Nicolas Patry d1e79ddae0
Fix override. 2024-09-16 17:01:29 +02:00
Nicolas Patry db054b95df
Check the paths. 2024-09-16 17:01:29 +02:00
Nicolas Patry afcd047a58
Yaml yaml. 2024-09-16 17:01:29 +02:00
Nicolas Patry 60db294f9a
Link cuda to nix ? 2024-09-16 17:01:28 +02:00
Nicolas Patry 8e7c7c61f1
Let's see what the issue is ? 2024-09-16 17:01:28 +02:00
Nicolas Patry 815449da74
Removing unused code. 2024-09-16 17:01:28 +02:00
Nicolas Patry c227345878
Run on actual GPUs. 2024-09-16 17:01:28 +02:00
Nicolas Patry 3d73c99ebe
Attempt at integration tests. 2024-09-16 17:01:28 +02:00
Nicolas Patry f47cdc1fe1
Attempting rapidly the integration tests. 2024-09-16 17:01:26 +02:00
Nicolas Patry 38fcafcf96
Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-16 17:00:54 +02:00
Daniël de Kok 7774655297
Add tests for Mixtral (#2520)
Disable by default because CI runners do not have enough GPUs.
2024-09-16 12:39:18 +02:00
Alex Strick van Linschoten 9cca3e0b03
Use `ratatui` not (deprecated) `tui` (#2521)
* use ratatui not archived tui

* bump ratatui all the way with options
2024-09-13 18:45:28 +02:00
Wang, Yi 3ac7df2b6d
hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-12 17:23:49 +02:00
drbh 628334d336
fix: pass missing revision arg for lora adapter when loading multiple… (#2510)
fix: pass missing revision arg for lora adapter when loading multiple adapters
2024-09-12 17:04:52 +02:00
Nicolas Patry d95c670ada
Add nix test. (#2513)
* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.
2024-09-12 14:54:56 +02:00
Daniël de Kok 94304649f1
nix: support Python tokenizer conversion in the router (#2515)
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
2024-09-12 10:44:01 +02:00
Nicolas Patry 69e3be20fb
Fix truffle (#2514)
* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.
2024-09-11 22:45:19 +02:00
Nicolas Patry dae3bf1d87
Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-11 22:41:56 +02:00
Nicolas Patry a4e3e8c608
Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-11 18:10:40 +02:00
Vallepu Vamsi Krishna eabbbbda23
Add Directory Check to Prevent Redundant Cloning in Build Process (#2486)
Update Makefile-fbgemm

Added Directory check for FBGEMM repository cloning.
2024-09-07 13:19:43 +02:00
Nicolas Patry c1fe28d694
Fixing more correctly the invalid drop of the batch. (#2498) 2024-09-06 17:35:49 +02:00
Martin Iglesias Goyanes aaea212d0f
Add links to Adyen blogpost (#2500)
* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-06 17:00:54 +02:00
Daniël de Kok a3c9c62dc0
hotfix: add syrupy to the right subproject (#2499) 2024-09-06 12:47:06 +02:00
Daniël de Kok 379472c4c2
radix trie: add assertions (#2491)
These should all be cheap assertions.

Also:

* Fixup some comments.
* Delete a `remove` that was done unnecessarily twice.
2024-09-06 11:55:23 +02:00
Daniël de Kok 2eb57a15ec
Fix incompatibility with latest `syrupy` and update in Poetry (#2497) 2024-09-06 11:00:52 +02:00
Daniël de Kok 0424e27f65
nix: add pyright/ruff for proper LSP in the impure devshell (#2496)
We need this to ensure that pyright/ruff are part of the same
interpreter/venv.
2024-09-06 10:19:04 +02:00
Wang, Yi 5cd8025f18
hotfix: fix regression of attention api change in intel platform (#2439)
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-05 17:41:39 +02:00
Daniël de Kok e279b38aca
Add two handy gitignores for Nix environments (#2484) 2024-09-05 17:06:54 +02:00
Nicolas Patry 8b96a18265
Adding links to Adyen blogpost. (#2492) 2024-09-05 16:11:52 +02:00
Daniël de Kok deec30f893
hotfix: avoid non-prefilled block use when using prefix caching (#2489)
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
2024-09-05 15:09:29 +02:00
drbh 6cb42f49ae
feat: support lora revisions and qkv_proj weights (#2482)
* feat: support lora revisions and qkv_proj weights

* fix: add qkv_proj weights to weight test
2024-09-02 13:09:06 -04:00
drbh 47d7e34458
fix: enable chat requests in vertex endpoint (#2481)
* fix: enable chat requests in vertex endpoint

* feat: avoid unwrap and pre allocate future vec
2024-09-02 10:00:52 -04:00
Daniël de Kok de2cdeca53
nix: add punica-kernels (#2477)
Enables LoRA support.
2024-09-02 11:31:36 +02:00
Daniël de Kok e4ab855480
nix: improve impure devshell (#2478)
- Add some test dependencies.
- Install server in venv.
- Install Python client in venv.
2024-09-02 09:27:10 +02:00
Nicolas Patry d9fbbaafb0
Tied embeddings in MLP speculator. (#2473)
* Tied embeddings in MLP speculator.

* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.

* Adding scaling support + optimize some ops.
2024-08-29 17:44:54 +02:00
Wang, Yi 9883f3b40e
update doc with intel cpu part (#2420)
* update doc with intel cpu part

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-29 17:42:02 +02:00
drbh d5202c46f7
feat: add /v1/models endpoint (#2433)
* feat: add /v1/models endpoint

* feat: add /v1/models endpoint

* fix: remove unused type import

* fix: revert route typo

* fix: update docs with new endpoint

* fix: add to redocly ignore and lint
2024-08-29 16:32:38 +02:00
Nicolas Patry e415b690a6
Lots of improvements (Still 2 allocators) (#2449)
* Making prefix/flashinfer the default and testing the full release tests.

* Include flashinfer in the docker.

* Using prebuilt.

* Allowing window_left_size (dummy version).

* Disabling flashinfer/prefix caching on odd head_dim

* Disable prefix caching for lora.

* More specific codes.

* Update lock

* Updating integration tests with new values with FI/FD.

Remove paged as a default too, and using FD everywhere.

* Update cargo lock ?

* Upgrade to 1.80 because of bitstream...

* Everywhere 1.80

* Forgot last default place.

* Apply suggestions from code review

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Updated flake lock

* Tmp

* Upgrade resolution system for less errors in resolution.

* Remove lambda for cleaner function.

* Handling debugger.

* OVerride the env in server tests.

* Is this enough to make it work ?

* This seems to be working.

* Downgrade some logs.

* Fixing the default for vlm.

* Don't enable prefix caching on VLM just yet.

* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)

* Fixing prefix caching for flashdecoding.

* Update all models.

* Fixed flashinfer version.

* add_special_tokens is internal only

* Fixing seqlen with the new vlms.

* Fixing the issue with `add_special_tokens` not being passed around.

* Fixing the test.

* Removing encoder_decoder (seq2seq).

* Update the chat test.

* Fixing the batching tokenization in flash causal lm.

* Truncating left for radix purposes.

* Oops this doesn't belong here.

* Put back default pure shell.

* Update server tests

- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room

* Only n_heads / process_group.size() are necessary.

* Revert the integrationt tests change (seem linked to head_size
modification).

* Adding error message when assert is violated.

* Fixing the free algorithm to handle times where the common prefix is
smaller.

* Apply suggestions from code review

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Update server/text_generation_server/layers/attention/common.py

Co-authored-by: OlivierDehaene <olivier@huggingface.co>

* Fix disabling prefix caching - Fix windowing checks.

* Revert the Cohere tokenizer change (for now using a revision instead).

* Fmt.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2024-08-29 16:29:01 +02:00
Daniël de Kok 4e821c003a
nix: build Torch against MKL and various other improvements (#2469)
Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.
2024-08-29 16:25:25 +02:00
drbh 8f99f165ce
fix: improve regex expression (#2468) 2024-08-28 13:44:44 -04:00