Nicolas Patry
3d46783f1a
Everywhere 1.80
2024-08-21 09:11:05 +02:00
Nicolas Patry
e2319fa891
Upgrade to 1.80 because of bitstream...
2024-08-21 09:11:05 +02:00
Nicolas Patry
f628886c0a
Update cargo lock ?
2024-08-21 09:11:05 +02:00
Nicolas Patry
2fe5879816
Updating integration tests with new values with FI/FD.
...
Remove paged as a default too, and using FD everywhere.
2024-08-21 09:11:04 +02:00
Nicolas Patry
e48e07c04b
Update lock
2024-08-21 09:11:03 +02:00
Nicolas Patry
bd0ced354d
More specific codes.
2024-08-21 09:10:49 +02:00
Nicolas Patry
f5ee062cbd
Disable prefix caching for lora.
2024-08-21 09:10:49 +02:00
Nicolas Patry
719d7b4d54
Disabling flashinfer/prefix caching on odd head_dim
2024-08-21 09:10:47 +02:00
Nicolas Patry
7857910435
Allowing window_left_size (dummy version).
2024-08-21 09:10:19 +02:00
Nicolas Patry
73fd04d60a
Using prebuilt.
2024-08-21 09:06:54 +02:00
Nicolas Patry
5336755358
Include flashinfer in the docker.
2024-08-21 09:06:54 +02:00
Nicolas Patry
52c813527a
Making prefix/flashinfer the default and testing the full release tests.
2024-08-21 09:06:54 +02:00
Nicolas Patry
310778e02a
Adding eetq to flake. ( #2438 )
2024-08-21 09:06:33 +02:00
Daniël de Kok
9474415095
nix: add `text-generation-benchmark` to pure devshell ( #2431 )
...
nix: add text-generation-benchmark to pure devshell
2024-08-21 07:48:13 +02:00
Daniël de Kok
f5f11b797e
nix: add pure server to flake, add both pure and impure devshells ( #2430 )
...
* nix: pure server and support both pure and impure devShells
* nix: remove unused poetry2nix input
It is not wired up and we now have a pure server.
* nix: add ipdb to impure devshell
2024-08-20 22:07:33 +02:00
Nicolas Patry
b70ae0969f
Prefix caching ( #2402 )
...
* Prefix caching WIP
* Fixing prefix attention.
* Fixing flashinfer import.
* Fixing black.
* Fixing medusa (still wrong outputs, but functional).
* Just medusa values now.
* Fixing medusa without prefix caching.
* Fixing prefix caching.
* Medusa requires reshaping.
* Removing the logs.
* Remove router.nix
* Fixup:
- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.
* Update flake.lock
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-20 11:15:30 +02:00
Daniël de Kok
38773453ae
nix: update to CUDA 12.4 ( #2429 )
...
* Update to CUDA 12.4
* poetry2nix: follow tgi-nix nixpkgs
2024-08-19 09:28:38 +02:00
Nicolas Patry
e4201f44cf
All integration tests back everywhere (too many failed CI). ( #2428 )
...
* All integration tests back everywhere (too many failed CI).
* Upgrade integration tests after 12.4
* Attempt to remove the specifed compute cap.
* Common arch list.
* Punica uses raw ASM which is not valid on 9.0 apparently.
2024-08-16 21:19:46 +02:00
Hugo Larcher
53729b74ac
doc: Add metrics documentation and add a 'Reference' section ( #2230 )
...
* doc: Add metrics documentation and add a 'Reference' section
* doc: Add API reference
* doc: Refactor API reference
* fix: Message API link
* Bad rebase
* Moving the docs.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-16 19:43:30 +02:00
Nicolas Patry
cb0a29484d
FIxing the CI.
2024-08-16 14:21:29 +02:00
Nicolas Patry
c7ab1810d4
Further fixes. ( #2426 )
...
* Further fixes.
* Update the conftest to allow NaN (first logprob).
* Fix the condition.
2024-08-16 13:21:44 +02:00
Vaibhav Srivastav
99b662f8c2
Improve the Consuming TGI + Streaming docs. ( #2412 )
...
* Improve the Consuming TGI docs.
* Fix erronous update to .
* add info about Open AI client.
* More updates.
* Apply suggestions from code review
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
* Suggestions from Lucain.
* Update Gradio snippet.
* Up.
* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>
* Update docs/source/basic_tutorials/consuming_tgi.md
Co-authored-by: Lucain <lucainp@gmail.com>
* Up.
* Apply suggestions from code review
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
* Up.
* Up.
* Doc review from Nico.
* Doc review from Nico. x2
* Last nit
---------
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
2024-08-16 12:43:08 +02:00
Daniël de Kok
1411bfb989
nix: try to reduce the number of Rust rebuilds ( #2424 )
...
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-08-16 10:01:01 +02:00
Nicolas Patry
1b0aa06204
Upgrading the tests to match the current workings. ( #2423 )
2024-08-15 13:28:42 +02:00
Nicolas Patry
57b3495823
Fixing exl2 and other quanize tests again. ( #2419 )
...
* Fixing exl2 and other quanize tests again.
* Mark exl2 as non release (so CI tests them, needs to be removed latet).
* Fixing exl2 (by disabling cuda graphs)
* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).
* Removing serde override.
* Go back to released exl2 and remove log.
* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-08-15 11:12:51 +02:00
Daniël de Kok
9aaa12e7ac
nix: build router incrementally ( #2422 )
2024-08-15 10:21:51 +02:00
Funtowicz Morgan
3f385991b0
More fixes trtllm ( #2342 )
...
* (backend) use parking_lot crate for RwLock fairness
* (docker) let's put rust in the TRTLLM folder when building
* (docker) build ompi with SLURM support
* (launcher) default new server::run parameters to false for now
* (chore) fmt ... why?
2024-08-14 12:02:05 +02:00
Nicolas Patry
f3b5c69441
Upgrading exl2. ( #2415 )
...
* Upgrading exl2.
* Fixing the other pathways.
* Fix idefics.
2024-08-14 11:58:08 +02:00
Daniël de Kok
c5fff92b48
nix: partial incremental build of the router ( #2416 )
...
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
2024-08-14 11:06:28 +02:00
drbh
1cebccc72b
fix: adds causal to attention params ( #2408 )
...
fix: adds causal to attention params to check when using flash attn v1
2024-08-13 16:19:46 +02:00
Wang, Yi
59922f9bc1
add numa to improve cpu inference perf ( #2330 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-08-13 15:33:55 +02:00
Nicolas Patry
cd9b15d17f
Adding more kernels to flake. ( #2411 )
2024-08-13 10:49:18 +02:00
Daniël de Kok
6f4bb4f26f
nix: incremental build of the launcher ( #2410 )
2024-08-13 10:44:15 +02:00
drbh
8a7749b8fb
fix: include create_exllama_buffers and set_device for exllama ( #2407 )
2024-08-12 17:59:37 -04:00
drbh
9a7830bd28
Pr 2395 ci run ( #2406 )
...
* fix(router): Fix appending to message content
* feat: add message and chat template test
---------
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-08-12 14:38:59 -04:00
Nicolas Patry
19ea85f8dc
Updating the flake. ( #2404 )
2024-08-12 18:09:16 +02:00
drbh
30395b09f4
fix: improve completions to send a final chunk with usage details ( #2336 )
...
* fix: improve completions to send a final chunk with usage details
* fix: include finish reason string
* fix: remove dev debug trait and unneeded mut
* fix: update openapi schema
2024-08-12 17:26:11 +02:00
drbh
4c3f8a70a1
fix: allocate tmp based on sgmv kernel if available ( #2345 )
...
* fix: allocate tmp based on sgmv kernel if available
* fix: re add copy build artifacts step for punica kernels
2024-08-12 17:24:32 +02:00
drbh
155f9c98e2
feat: validate template variables before apply and improve sliding wi… ( #2403 )
...
* feat: validate template variables before apply and improve sliding window check
* fix: improve missing template var test
2024-08-12 10:58:40 -04:00
Nicolas Patry
136bcc8128
Keeping the benchmark somewhere ( #2401 )
...
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-12 15:22:02 +02:00
Daniël de Kok
8deeaca4ff
Add support for prefix caching to the v3 router ( #2392 )
...
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.
For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.
Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
2024-08-12 14:59:17 +02:00
Wang, Yi
b6bb1d5160
Cpu dockerimage ( #2367 )
...
add intel-cpu docker image
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-08-12 14:10:30 +02:00
Nicolas Patry
84bc3d7b7d
Fixing import exl2 ( #2399 )
2024-08-12 14:08:59 +02:00
Nicolas Patry
730fa00e20
Adding launcher to build. ( #2397 )
2024-08-12 14:08:46 +02:00
Nicolas Patry
9c739651cd
Upgrade fbgemm ( #2398 )
...
* Upgrade fbgemm
* Fix fbgemm version
2024-08-12 14:08:38 +02:00
Daniël de Kok
01a515dea2
nix: add router to the devshell ( #2396 )
2024-08-12 09:28:38 +02:00
Daniël de Kok
8dcc7d3f6b
Update flake for 9.0a capability in Torch ( #2394 )
2024-08-09 22:36:51 +02:00
drbh
0d06aed02d
feat: add guideline to chat request and template ( #2391 )
...
* feat: add guideline to chat request and template
* fix: add template test and update docs
2024-08-09 10:56:45 -04:00
Nicolas Patry
7a48a84784
Using an enum for flash backens (paged/flashdecoding/flashinfer) ( #2385 )
...
* Using an enum for flash backens (paged/flashdecoding/flashinfer)
* Early exit on server too.
* Clippy.
* Fix clippy and fmt.
2024-08-09 16:41:17 +02:00
Daniël de Kok
6e127dcc96
flake: use rust-overlay ( #2390 )
2024-08-09 15:24:21 +02:00