Commit Graph

1004 Commits

Author SHA1 Message Date
Nicolas Patry f1c0735453
Don't enable prefix caching on VLM just yet. 2024-08-27 20:06:11 +02:00
Nicolas Patry e30fb25444
Fixing the default for vlm. 2024-08-27 20:06:11 +02:00
Nicolas Patry 27b566baa8
Downgrade some logs. 2024-08-27 20:06:11 +02:00
Nicolas Patry 26e5037de4
This seems to be working. 2024-08-27 20:06:10 +02:00
Nicolas Patry f5182c188c
Is this enough to make it work ? 2024-08-27 20:06:10 +02:00
Nicolas Patry 1568e82548
OVerride the env in server tests. 2024-08-27 20:06:10 +02:00
Nicolas Patry 682db34b6a
Handling debugger. 2024-08-27 20:06:10 +02:00
Nicolas Patry c53968dc45
Remove lambda for cleaner function. 2024-08-27 20:06:10 +02:00
Nicolas Patry 32f6416358
Upgrade resolution system for less errors in resolution. 2024-08-27 20:06:10 +02:00
Nicolas Patry 5eb6ea0063
Tmp 2024-08-27 20:06:09 +02:00
Nicolas Patry 0bf4eb9683
Updated flake lock 2024-08-27 20:06:09 +02:00
Nicolas Patry b80593bfa3
Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-08-27 20:06:09 +02:00
Nicolas Patry 8d0220a695
Forgot last default place. 2024-08-27 20:06:09 +02:00
Nicolas Patry 860b550cdf
Everywhere 1.80 2024-08-27 20:06:09 +02:00
Nicolas Patry 344fee0d44
Upgrade to 1.80 because of bitstream... 2024-08-27 20:06:09 +02:00
Nicolas Patry 17c8a5e574
Update cargo lock ? 2024-08-27 20:06:06 +02:00
Nicolas Patry ba1ce20ce8
Updating integration tests with new values with FI/FD.
Remove paged as a default too, and using FD everywhere.
2024-08-27 20:05:29 +02:00
Nicolas Patry ffb6841121
Update lock 2024-08-27 20:05:29 +02:00
Nicolas Patry f0b35f94b8
More specific codes. 2024-08-27 20:05:29 +02:00
Nicolas Patry a6cd5fef23
Disable prefix caching for lora. 2024-08-27 20:05:29 +02:00
Nicolas Patry cba59aca03
Disabling flashinfer/prefix caching on odd head_dim 2024-08-27 20:05:29 +02:00
Nicolas Patry f55278de2d
Allowing window_left_size (dummy version). 2024-08-27 20:05:29 +02:00
Nicolas Patry f2bdc65098
Using prebuilt. 2024-08-27 20:05:28 +02:00
Nicolas Patry 9d4c5d39fe
Include flashinfer in the docker. 2024-08-27 20:05:28 +02:00
Nicolas Patry 60719babf6
Making prefix/flashinfer the default and testing the full release tests. 2024-08-27 20:05:28 +02:00
drbh 21187c27c9
fix: bump minijinja version and add test for llama 3.1 tools (#2463)
* fix: support tojson and avoid message indexing issue in template

* fix: prefer minijinja native methods and prefer workspace level dependency

* fix: adjust comment typo
2024-08-27 13:31:08 -04:00
Nicolas Patry 2788d41a76
Fixing CI. (#2462) 2024-08-27 15:33:02 +02:00
drbh cfa73b5c99
Pr 2451 ci branch (#2454)
* fix[router]: Fix tools not passed in chat template

Signed-off-by: GitHub <noreply@github.com>

* feat: improve default tool serialization and lints

* feat: refactor tool logic to include notify_error in prompt and adjust typing

* fix: adjust non tool template apply

* fix: simplify tool grammar logic and improve schema

* feat: avoid skip tool test and avoid empty tool prompts

* fix: increase test client timeout for grammar compilation tests

---------

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>
2024-08-26 20:19:38 -04:00
drbh 30be188400
Fix: don't apply post layernorm in SiglipVisionTransformer (#2459)
* Fix: don't apply post layernorm in SiglipVisionTransformer

This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813).

This also makes Siglip consistent with the existing Clip implementation:

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613

* fix: adjust pali gemma for post layer norm and small refactors

---------

Co-authored-by: Travis Addair <tgaddair@gmail.com>
2024-08-26 17:04:46 -04:00
Daniël de Kok f3c5d7d92f
nix: add default package (#2453)
The default package wraps the launcher and puts the server/router in the
path.

As a result, TGI can be started using something like:

```
nix run .# -- \
  --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --port 8080
```
2024-08-23 22:06:22 +02:00
Daniël de Kok 358ceb67dd
nix: add awq-inference-engine as server dependency (#2442) 2024-08-21 22:20:03 +02:00
Nicolas Patry 310778e02a
Adding eetq to flake. (#2438) 2024-08-21 09:06:33 +02:00
Daniël de Kok 9474415095
nix: add `text-generation-benchmark` to pure devshell (#2431)
nix: add text-generation-benchmark to pure devshell
2024-08-21 07:48:13 +02:00
Daniël de Kok f5f11b797e
nix: add pure server to flake, add both pure and impure devshells (#2430)
* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell
2024-08-20 22:07:33 +02:00
Nicolas Patry b70ae0969f
Prefix caching (#2402)
* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-08-20 11:15:30 +02:00
Daniël de Kok 38773453ae
nix: update to CUDA 12.4 (#2429)
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
2024-08-19 09:28:38 +02:00
Nicolas Patry e4201f44cf
All integration tests back everywhere (too many failed CI). (#2428)
* All integration tests back everywhere (too many failed CI).

* Upgrade integration tests after 12.4

* Attempt to remove the specifed compute cap.

* Common arch list.

* Punica uses raw ASM which is not valid on 9.0 apparently.
2024-08-16 21:19:46 +02:00
Hugo Larcher 53729b74ac
doc: Add metrics documentation and add a 'Reference' section (#2230)
* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-16 19:43:30 +02:00
Nicolas Patry cb0a29484d
FIxing the CI. 2024-08-16 14:21:29 +02:00
Nicolas Patry c7ab1810d4
Further fixes. (#2426)
* Further fixes.

* Update the conftest to allow NaN (first logprob).

* Fix the condition.
2024-08-16 13:21:44 +02:00
Vaibhav Srivastav 99b662f8c2
Improve the Consuming TGI + Streaming docs. (#2412)
* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
2024-08-16 12:43:08 +02:00
Daniël de Kok 1411bfb989
nix: try to reduce the number of Rust rebuilds (#2424)
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
2024-08-16 10:01:01 +02:00
Nicolas Patry 1b0aa06204
Upgrading the tests to match the current workings. (#2423) 2024-08-15 13:28:42 +02:00
Nicolas Patry 57b3495823
Fixing exl2 and other quanize tests again. (#2419)
* Fixing exl2 and other quanize tests again.

* Mark exl2 as non release (so CI tests them, needs to be removed latet).

* Fixing exl2 (by disabling cuda graphs)

* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).

* Removing serde override.

* Go back to released exl2 and remove log.

* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
2024-08-15 11:12:51 +02:00
Daniël de Kok 9aaa12e7ac
nix: build router incrementally (#2422) 2024-08-15 10:21:51 +02:00
Funtowicz Morgan 3f385991b0
More fixes trtllm (#2342)
* (backend) use parking_lot crate for RwLock fairness

* (docker) let's put rust in the TRTLLM folder when building

* (docker) build ompi with SLURM support

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?
2024-08-14 12:02:05 +02:00
Nicolas Patry f3b5c69441
Upgrading exl2. (#2415)
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
2024-08-14 11:58:08 +02:00
Daniël de Kok c5fff92b48
nix: partial incremental build of the router (#2416)
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
2024-08-14 11:06:28 +02:00
drbh 1cebccc72b
fix: adds causal to attention params (#2408)
fix: adds causal to attention params to check when using flash attn v1
2024-08-13 16:19:46 +02:00
Wang, Yi 59922f9bc1
add numa to improve cpu inference perf (#2330)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-08-13 15:33:55 +02:00