Nicolas Patry
f1c0735453
Don't enable prefix caching on VLM just yet.
2024-08-27 20:06:11 +02:00
Nicolas Patry
e30fb25444
Fixing the default for vlm.
2024-08-27 20:06:11 +02:00
Nicolas Patry
27b566baa8
Downgrade some logs.
2024-08-27 20:06:11 +02:00
Nicolas Patry
26e5037de4
This seems to be working.
2024-08-27 20:06:10 +02:00
Nicolas Patry
f5182c188c
Is this enough to make it work ?
2024-08-27 20:06:10 +02:00
Nicolas Patry
1568e82548
OVerride the env in server tests.
2024-08-27 20:06:10 +02:00
Nicolas Patry
682db34b6a
Handling debugger.
2024-08-27 20:06:10 +02:00
Nicolas Patry
c53968dc45
Remove lambda for cleaner function.
2024-08-27 20:06:10 +02:00
Nicolas Patry
32f6416358
Upgrade resolution system for less errors in resolution.
2024-08-27 20:06:10 +02:00
Nicolas Patry
5eb6ea0063
Tmp
2024-08-27 20:06:09 +02:00
Nicolas Patry
0bf4eb9683
Updated flake lock
2024-08-27 20:06:09 +02:00
Nicolas Patry
b80593bfa3
Apply suggestions from code review
...
Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-08-27 20:06:09 +02:00
Nicolas Patry
8d0220a695
Forgot last default place.
2024-08-27 20:06:09 +02:00
Nicolas Patry
860b550cdf
Everywhere 1.80
2024-08-27 20:06:09 +02:00
Nicolas Patry
344fee0d44
Upgrade to 1.80 because of bitstream...
2024-08-27 20:06:09 +02:00
Nicolas Patry
17c8a5e574
Update cargo lock ?
2024-08-27 20:06:06 +02:00
Nicolas Patry
ba1ce20ce8
Updating integration tests with new values with FI/FD.
...
Remove paged as a default too, and using FD everywhere.
2024-08-27 20:05:29 +02:00
Nicolas Patry
ffb6841121
Update lock
2024-08-27 20:05:29 +02:00
Nicolas Patry
f0b35f94b8
More specific codes.
2024-08-27 20:05:29 +02:00
Nicolas Patry
a6cd5fef23
Disable prefix caching for lora.
2024-08-27 20:05:29 +02:00
Nicolas Patry
cba59aca03
Disabling flashinfer/prefix caching on odd head_dim
2024-08-27 20:05:29 +02:00
Nicolas Patry
f55278de2d
Allowing window_left_size (dummy version).
2024-08-27 20:05:29 +02:00
Nicolas Patry
f2bdc65098
Using prebuilt.
2024-08-27 20:05:28 +02:00
Nicolas Patry
9d4c5d39fe
Include flashinfer in the docker.
2024-08-27 20:05:28 +02:00
Nicolas Patry
60719babf6
Making prefix/flashinfer the default and testing the full release tests.
2024-08-27 20:05:28 +02:00
drbh
21187c27c9
fix: bump minijinja version and add test for llama 3.1 tools ( #2463 )
...
* fix: support tojson and avoid message indexing issue in template
* fix: prefer minijinja native methods and prefer workspace level dependency
* fix: adjust comment typo
2024-08-27 13:31:08 -04:00
Nicolas Patry
2788d41a76
Fixing CI. ( #2462 )
2024-08-27 15:33:02 +02:00
drbh
cfa73b5c99
Pr 2451 ci branch ( #2454 )
...
* fix[router]: Fix tools not passed in chat template
Signed-off-by: GitHub <noreply@github.com>
* feat: improve default tool serialization and lints
* feat: refactor tool logic to include notify_error in prompt and adjust typing
* fix: adjust non tool template apply
* fix: simplify tool grammar logic and improve schema
* feat: avoid skip tool test and avoid empty tool prompts
* fix: increase test client timeout for grammar compilation tests
---------
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-08-26 20:19:38 -04:00
drbh
30be188400
Fix: don't apply post layernorm in SiglipVisionTransformer ( #2459 )
...
* Fix: don't apply post layernorm in SiglipVisionTransformer
This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813 ).
This also makes Siglip consistent with the existing Clip implementation:
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613
* fix: adjust pali gemma for post layer norm and small refactors
---------
Co-authored-by: Travis Addair <tgaddair@gmail.com>
2024-08-26 17:04:46 -04:00
Daniël de Kok
f3c5d7d92f
nix: add default package ( #2453 )
...
The default package wraps the launcher and puts the server/router in the
path.
As a result, TGI can be started using something like:
```
nix run .# -- \
--model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--port 8080
```
2024-08-23 22:06:22 +02:00
Daniël de Kok
358ceb67dd
nix: add awq-inference-engine as server dependency ( #2442 )
2024-08-21 22:20:03 +02:00
Nicolas Patry
310778e02a
Adding eetq to flake. ( #2438 )
2024-08-21 09:06:33 +02:00
Daniël de Kok
9474415095
nix: add `text-generation-benchmark` to pure devshell ( #2431 )
...
nix: add text-generation-benchmark to pure devshell
2024-08-21 07:48:13 +02:00
Daniël de Kok
f5f11b797e
nix: add pure server to flake, add both pure and impure devshells ( #2430 )
...
* nix: pure server and support both pure and impure devShells
* nix: remove unused poetry2nix input
It is not wired up and we now have a pure server.
* nix: add ipdb to impure devshell
2024-08-20 22:07:33 +02:00
Nicolas Patry
b70ae0969f
Prefix caching ( #2402 )
...
* Prefix caching WIP
* Fixing prefix attention.
* Fixing flashinfer import.
* Fixing black.
* Fixing medusa (still wrong outputs, but functional).
* Just medusa values now.
* Fixing medusa without prefix caching.
* Fixing prefix caching.
* Medusa requires reshaping.
* Removing the logs.
* Remove router.nix
* Fixup:
- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.
* Update flake.lock
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-20 11:15:30 +02:00
Daniël de Kok
38773453ae
nix: update to CUDA 12.4 ( #2429 )
...
* Update to CUDA 12.4
* poetry2nix: follow tgi-nix nixpkgs
2024-08-19 09:28:38 +02:00
Nicolas Patry
e4201f44cf
All integration tests back everywhere (too many failed CI). ( #2428 )
...
* All integration tests back everywhere (too many failed CI).
* Upgrade integration tests after 12.4
* Attempt to remove the specifed compute cap.
* Common arch list.
* Punica uses raw ASM which is not valid on 9.0 apparently.
2024-08-16 21:19:46 +02:00
Hugo Larcher
53729b74ac
doc: Add metrics documentation and add a 'Reference' section ( #2230 )
...
* doc: Add metrics documentation and add a 'Reference' section
* doc: Add API reference
* doc: Refactor API reference
* fix: Message API link
* Bad rebase
* Moving the docs.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-16 19:43:30 +02:00
Nicolas Patry
cb0a29484d
FIxing the CI.
2024-08-16 14:21:29 +02:00
Nicolas Patry
c7ab1810d4
Further fixes. ( #2426 )
...
* Further fixes.
* Update the conftest to allow NaN (first logprob).
* Fix the condition.
2024-08-16 13:21:44 +02:00
Vaibhav Srivastav
99b662f8c2
Improve the Consuming TGI + Streaming docs. ( #2412 )
...
* Improve the Consuming TGI docs.
* Fix erronous update to .
* add info about Open AI client.
* More updates.
* Apply suggestions from code review
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
* Suggestions from Lucain.
* Update Gradio snippet.
* Up.
* Apply suggestions from code review
Co-authored-by: Lucain <lucainp@gmail.com>
* Update docs/source/basic_tutorials/consuming_tgi.md
Co-authored-by: Lucain <lucainp@gmail.com>
* Up.
* Apply suggestions from code review
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
* Up.
* Up.
* Doc review from Nico.
* Doc review from Nico. x2
* Last nit
---------
Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
2024-08-16 12:43:08 +02:00
Daniël de Kok
1411bfb989
nix: try to reduce the number of Rust rebuilds ( #2424 )
...
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-08-16 10:01:01 +02:00
Nicolas Patry
1b0aa06204
Upgrading the tests to match the current workings. ( #2423 )
2024-08-15 13:28:42 +02:00
Nicolas Patry
57b3495823
Fixing exl2 and other quanize tests again. ( #2419 )
...
* Fixing exl2 and other quanize tests again.
* Mark exl2 as non release (so CI tests them, needs to be removed latet).
* Fixing exl2 (by disabling cuda graphs)
* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).
* Removing serde override.
* Go back to released exl2 and remove log.
* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-08-15 11:12:51 +02:00
Daniël de Kok
9aaa12e7ac
nix: build router incrementally ( #2422 )
2024-08-15 10:21:51 +02:00
Funtowicz Morgan
3f385991b0
More fixes trtllm ( #2342 )
...
* (backend) use parking_lot crate for RwLock fairness
* (docker) let's put rust in the TRTLLM folder when building
* (docker) build ompi with SLURM support
* (launcher) default new server::run parameters to false for now
* (chore) fmt ... why?
2024-08-14 12:02:05 +02:00
Nicolas Patry
f3b5c69441
Upgrading exl2. ( #2415 )
...
* Upgrading exl2.
* Fixing the other pathways.
* Fix idefics.
2024-08-14 11:58:08 +02:00
Daniël de Kok
c5fff92b48
nix: partial incremental build of the router ( #2416 )
...
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
2024-08-14 11:06:28 +02:00
drbh
1cebccc72b
fix: adds causal to attention params ( #2408 )
...
fix: adds causal to attention params to check when using flash attn v1
2024-08-13 16:19:46 +02:00
Wang, Yi
59922f9bc1
add numa to improve cpu inference perf ( #2330 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-08-13 15:33:55 +02:00