hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Daniël de Kok	5b6b74e21d	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-09-27 16:19:42 +02:00
Nicolas Patry	f512021e77	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-19 20:50:37 +02:00
Daniël de Kok	ce85efa968	Move to moe-kernels package and switch to common MoE layer (#2511 ) * Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner	2024-09-17 18:08:58 +02:00
Daniël de Kok	94304649f1	nix: support Python tokenizer conversion in the router (#2515 ) Ideally we wouldn't have the router wrapper that this change adds, but when I give PyO3 a Python interpreter with packages, it ends up linking libpython from the Python interpreter rather than the constructed environment and cannot pick up the Python modules as a result.	2024-09-12 10:44:01 +02:00
Daniël de Kok	de2cdeca53	nix: add punica-kernels (#2477 ) Enables LoRA support.	2024-09-02 11:31:36 +02:00
Daniël de Kok	4e821c003a	nix: build Torch against MKL and various other improvements (#2469 ) Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.	2024-08-29 16:25:25 +02:00
Daniël de Kok	358ceb67dd	nix: add awq-inference-engine as server dependency (#2442 )	2024-08-21 22:20:03 +02:00
Nicolas Patry	310778e02a	Adding eetq to flake. (#2438 )	2024-08-21 09:06:33 +02:00
Daniël de Kok	9474415095	nix: add `text-generation-benchmark` to pure devshell (#2431 ) nix: add text-generation-benchmark to pure devshell	2024-08-21 07:48:13 +02:00
Daniël de Kok	f5f11b797e	nix: add pure server to flake, add both pure and impure devshells (#2430 ) * nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell	2024-08-20 22:07:33 +02:00
Daniël de Kok	1411bfb989	nix: try to reduce the number of Rust rebuilds (#2424 ) Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.	2024-08-16 10:01:01 +02:00

11 Commits