hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	2652e209e7	Updated flake lock	2024-08-21 09:15:10 +02:00
Nicolas Patry	3ece76392b	Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-08-21 09:11:05 +02:00
Nicolas Patry	cdbf73eef8	Forgot last default place.	2024-08-21 09:11:05 +02:00
Nicolas Patry	3d46783f1a	Everywhere 1.80	2024-08-21 09:11:05 +02:00
Nicolas Patry	e2319fa891	Upgrade to 1.80 because of bitstream...	2024-08-21 09:11:05 +02:00
Nicolas Patry	f628886c0a	Update cargo lock ?	2024-08-21 09:11:05 +02:00
Nicolas Patry	2fe5879816	Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere.	2024-08-21 09:11:04 +02:00
Nicolas Patry	e48e07c04b	Update lock	2024-08-21 09:11:03 +02:00
Nicolas Patry	bd0ced354d	More specific codes.	2024-08-21 09:10:49 +02:00
Nicolas Patry	f5ee062cbd	Disable prefix caching for lora.	2024-08-21 09:10:49 +02:00
Nicolas Patry	719d7b4d54	Disabling flashinfer/prefix caching on odd head_dim	2024-08-21 09:10:47 +02:00
Nicolas Patry	7857910435	Allowing window_left_size (dummy version).	2024-08-21 09:10:19 +02:00
Nicolas Patry	73fd04d60a	Using prebuilt.	2024-08-21 09:06:54 +02:00
Nicolas Patry	5336755358	Include flashinfer in the docker.	2024-08-21 09:06:54 +02:00
Nicolas Patry	52c813527a	Making prefix/flashinfer the default and testing the full release tests.	2024-08-21 09:06:54 +02:00
Nicolas Patry	310778e02a	Adding eetq to flake. (#2438 )	2024-08-21 09:06:33 +02:00
Daniël de Kok	9474415095	nix: add `text-generation-benchmark` to pure devshell (#2431 ) nix: add text-generation-benchmark to pure devshell	2024-08-21 07:48:13 +02:00
Daniël de Kok	f5f11b797e	nix: add pure server to flake, add both pure and impure devshells (#2430 ) * nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell	2024-08-20 22:07:33 +02:00
Nicolas Patry	b70ae0969f	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-20 11:15:30 +02:00
Daniël de Kok	38773453ae	nix: update to CUDA 12.4 (#2429 ) * Update to CUDA 12.4 * poetry2nix: follow tgi-nix nixpkgs	2024-08-19 09:28:38 +02:00
Nicolas Patry	e4201f44cf	All integration tests back everywhere (too many failed CI). (#2428 ) * All integration tests back everywhere (too many failed CI). * Upgrade integration tests after 12.4 * Attempt to remove the specifed compute cap. * Common arch list. * Punica uses raw ASM which is not valid on 9.0 apparently.	2024-08-16 21:19:46 +02:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00
Nicolas Patry	cb0a29484d	FIxing the CI.	2024-08-16 14:21:29 +02:00
Nicolas Patry	c7ab1810d4	Further fixes. (#2426 ) * Further fixes. * Update the conftest to allow NaN (first logprob). * Fix the condition.	2024-08-16 13:21:44 +02:00
Vaibhav Srivastav	99b662f8c2	Improve the Consuming TGI + Streaming docs. (#2412 ) * Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by: Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>	2024-08-16 12:43:08 +02:00
Daniël de Kok	1411bfb989	nix: try to reduce the number of Rust rebuilds (#2424 ) Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.	2024-08-16 10:01:01 +02:00
Nicolas Patry	1b0aa06204	Upgrading the tests to match the current workings. (#2423 )	2024-08-15 13:28:42 +02:00
Nicolas Patry	57b3495823	Fixing exl2 and other quanize tests again. (#2419 ) * Fixing exl2 and other quanize tests again. * Mark exl2 as non release (so CI tests them, needs to be removed latet). * Fixing exl2 (by disabling cuda graphs) * Fix quantization defaults without cuda graphs on exl2 (linked to new issues with it). * Removing serde override. * Go back to released exl2 and remove log. * Adding warnings for deprecated bitsandbytes + upgrade info to warn.	2024-08-15 11:12:51 +02:00
Daniël de Kok	9aaa12e7ac	nix: build router incrementally (#2422 )	2024-08-15 10:21:51 +02:00
Funtowicz Morgan	3f385991b0	More fixes trtllm (#2342 ) * (backend) use parking_lot crate for RwLock fairness * (docker) let's put rust in the TRTLLM folder when building * (docker) build ompi with SLURM support * (launcher) default new server::run parameters to false for now * (chore) fmt ... why?	2024-08-14 12:02:05 +02:00
Nicolas Patry	f3b5c69441	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-08-14 11:58:08 +02:00
Daniël de Kok	c5fff92b48	nix: partial incremental build of the router (#2416 ) This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.	2024-08-14 11:06:28 +02:00
drbh	1cebccc72b	fix: adds causal to attention params (#2408 ) fix: adds causal to attention params to check when using flash attn v1	2024-08-13 16:19:46 +02:00
Wang, Yi	59922f9bc1	add numa to improve cpu inference perf (#2330 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-13 15:33:55 +02:00
Nicolas Patry	cd9b15d17f	Adding more kernels to flake. (#2411 )	2024-08-13 10:49:18 +02:00
Daniël de Kok	6f4bb4f26f	nix: incremental build of the launcher (#2410 )	2024-08-13 10:44:15 +02:00
drbh	8a7749b8fb	fix: include create_exllama_buffers and set_device for exllama (#2407 )	2024-08-12 17:59:37 -04:00
drbh	9a7830bd28	Pr 2395 ci run (#2406 ) * fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-08-12 14:38:59 -04:00
Nicolas Patry	19ea85f8dc	Updating the flake. (#2404 )	2024-08-12 18:09:16 +02:00
drbh	30395b09f4	fix: improve completions to send a final chunk with usage details (#2336 ) * fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema	2024-08-12 17:26:11 +02:00
drbh	4c3f8a70a1	fix: allocate tmp based on sgmv kernel if available (#2345 ) * fix: allocate tmp based on sgmv kernel if available * fix: re add copy build artifacts step for punica kernels	2024-08-12 17:24:32 +02:00
drbh	155f9c98e2	feat: validate template variables before apply and improve sliding wi… (#2403 ) * feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test	2024-08-12 10:58:40 -04:00
Nicolas Patry	136bcc8128	Keeping the benchmark somewhere (#2401 ) Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-12 15:22:02 +02:00
Daniël de Kok	8deeaca4ff	Add support for prefix caching to the v3 router (#2392 ) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.	2024-08-12 14:59:17 +02:00
Wang, Yi	b6bb1d5160	Cpu dockerimage (#2367 ) add intel-cpu docker image Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-12 14:10:30 +02:00
Nicolas Patry	84bc3d7b7d	Fixing import exl2 (#2399 )	2024-08-12 14:08:59 +02:00
Nicolas Patry	730fa00e20	Adding launcher to build. (#2397 )	2024-08-12 14:08:46 +02:00
Nicolas Patry	9c739651cd	Upgrade fbgemm (#2398 ) * Upgrade fbgemm * Fix fbgemm version	2024-08-12 14:08:38 +02:00
Daniël de Kok	01a515dea2	nix: add router to the devshell (#2396 )	2024-08-12 09:28:38 +02:00
Daniël de Kok	8dcc7d3f6b	Update flake for 9.0a capability in Torch (#2394 )	2024-08-09 22:36:51 +02:00

1 2 3 4 5 ...

988 Commits All Branches Search

988 Commits

All Branches