hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Daniël de Kok	2007a9473a	Update to moe-kernels 0.7.0 (#2720 ) This version syncs with the vLLM kernels and brings some performance improvements.	2024-11-19 14:55:29 +01:00
Daniël de Kok	3c9df21ff8	Add support for compressed-tensors w8a8 int checkpoints (#2745 ) * Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: \| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|---------------\|------:\|----------------\|-----:\|-----------------------\|---\|-----:\|---\|------\| \|gsm8k_cot_llama\| 3\|flexible-extract\| 8\|exact_match \|↑ \|0.8431\|± \|0.0100\| \| \| \|strict-match \| 8\|exact_match \|↑ \|0.8393\|± \|0.0101\| \|ifeval \| 4\|none \| 0\|inst_level_loose_acc \|↑ \|0.8597\|± \| N/A\| \| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.8201\|± \| N/A\| \| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.7967\|± \|0.0173\| \| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.7468\|± \|0.0187\| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by: drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-11-18 17:20:31 +01:00
Daniël de Kok	52e48739a5	Remove vLLM dependency for CUDA (#2751 ) * Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning	2024-11-17 17:34:50 +01:00
Daniël de Kok	ca4f46ddfc	nix: update nixpkgs (#2746 ) Updates from Triton 2.1.0 to 3.1.0 (among other things).	2024-11-14 18:48:20 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
Daniël de Kok	5eedb2ec7a	nix: move to tgi-nix `main` (#2718 )	2024-11-04 15:40:13 +01:00
Daniël de Kok	0f346a3296	Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688 ) * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots	2024-10-25 16:40:47 +02:00
Daniël de Kok	eab07f746c	Add support for FP8 KV cache scales (#2628 ) * Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer	2024-10-24 16:36:18 +02:00
Daniël de Kok	9c9ef37c56	Add `impureWithCuda` dev shell (#2677 ) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN	2024-10-22 11:02:55 +02:00
Daniël de Kok	6db3bcb700	nix: move back to the tgi-nix main branch (#2620 )	2024-10-08 12:55:05 +02:00
Daniël de Kok	64142489b6	Add support for fused MoE Marlin for AWQ (#2616 ) * Add support for fused MoE Marlin for AWQ This uses the updated MoE Marlin kernels from vLLM. * Add integration test for AWQ MoE	2024-10-08 11:56:41 +02:00
Daniël de Kok	68103079f4	nix: example of local package overrides during development (#2607 )	2024-10-04 16:52:42 +02:00
Nicolas Patry	d18ed5cfc5	Mllama flash version (#2585 ) * Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0	2024-10-02 11:22:13 +02:00
Daniël de Kok	584b4d7a68	nix: experimental support for building a Docker container (#2470 ) * nix: experimental support for building a Docker image Run using something like: ``` docker run \ --device nvidia.com/gpu=all \ -it --rm -p 8080:80 \ -v $PWD/data:/data \ -v $PWD/tmp:/tmp \ tgi-docker:latest \ --model-id <model_id> ``` * Example of building the Docker image using Nix inside Docker * Stream to make the builder image smaller This avoids storing a Docker image tarball in the image. Instead, stream the layers while doing `docker run`. * Don't spam journalctl on Linux * Other dockerfile. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-01 18:02:06 +02:00
Daniël de Kok	1c84a30fe6	MoE Marlin: support `desc_act` for `groupsize != -1` (#2590 ) This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.	2024-09-30 19:40:25 +02:00
Daniël de Kok	d1f257ac56	Move flake back to tgi-nix `main` (#2586 )	2024-09-30 11:39:41 +02:00
Daniël de Kok	90a1d04a2f	Add support for GPTQ-quantized MoE models using MoE Marlin (#2557 ) This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.	2024-09-30 11:14:32 +02:00
Daniël de Kok	5b6b74e21d	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-09-27 16:19:42 +02:00
Daniël de Kok	abd24dd385	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-19 22:17:15 +02:00
Nicolas Patry	f512021e77	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-19 20:50:37 +02:00
Daniël de Kok	71e4268600	nix: pure Rust check/fmt/clippy/test (#2525 ) Runs the tests in a Nix build sandbox.	2024-09-17 12:14:30 +02:00
Nicolas Patry	d95c670ada	Add nix test. (#2513 ) * Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.	2024-09-12 14:54:56 +02:00
Daniël de Kok	94304649f1	nix: support Python tokenizer conversion in the router (#2515 ) Ideally we wouldn't have the router wrapper that this change adds, but when I give PyO3 a Python interpreter with packages, it ends up linking libpython from the Python interpreter rather than the constructed environment and cannot pick up the Python modules as a result.	2024-09-12 10:44:01 +02:00
Daniël de Kok	0424e27f65	nix: add pyright/ruff for proper LSP in the impure devshell (#2496 ) We need this to ensure that pyright/ruff are part of the same interpreter/venv.	2024-09-06 10:19:04 +02:00
Daniël de Kok	e4ab855480	nix: improve impure devshell (#2478 ) - Add some test dependencies. - Install server in venv. - Install Python client in venv.	2024-09-02 09:27:10 +02:00
Daniël de Kok	4e821c003a	nix: build Torch against MKL and various other improvements (#2469 ) Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.	2024-08-29 16:25:25 +02:00
Daniël de Kok	f3c5d7d92f	nix: add default package (#2453 ) The default package wraps the launcher and puts the server/router in the path. As a result, TGI can be started using something like: ``` nix run .# -- \ --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \ --port 8080 ```	2024-08-23 22:06:22 +02:00
Daniël de Kok	9474415095	nix: add `text-generation-benchmark` to pure devshell (#2431 ) nix: add text-generation-benchmark to pure devshell	2024-08-21 07:48:13 +02:00
Daniël de Kok	f5f11b797e	nix: add pure server to flake, add both pure and impure devshells (#2430 ) * nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell	2024-08-20 22:07:33 +02:00
Nicolas Patry	b70ae0969f	Prefix caching (#2402 ) * Prefix caching WIP * Fixing prefix attention. * Fixing flashinfer import. * Fixing black. * Fixing medusa (still wrong outputs, but functional). * Just medusa values now. * Fixing medusa without prefix caching. * Fixing prefix caching. * Medusa requires reshaping. * Removing the logs. * Remove router.nix * Fixup: - Remove logs - Disable VLMs (they do not work) - Disable prefix caching when user wants prefill logprobs. * Update flake.lock --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-20 11:15:30 +02:00
Daniël de Kok	38773453ae	nix: update to CUDA 12.4 (#2429 ) * Update to CUDA 12.4 * poetry2nix: follow tgi-nix nixpkgs	2024-08-19 09:28:38 +02:00
Daniël de Kok	1411bfb989	nix: try to reduce the number of Rust rebuilds (#2424 ) Try to reduce the number of router/launcher rebuilds by filtering sources. In this way, recompiles should only be triggered by changes in Cargo or Rust files.	2024-08-16 10:01:01 +02:00
Daniël de Kok	9aaa12e7ac	nix: build router incrementally (#2422 )	2024-08-15 10:21:51 +02:00
Nicolas Patry	f3b5c69441	Upgrading exl2. (#2415 ) * Upgrading exl2. * Fixing the other pathways. * Fix idefics.	2024-08-14 11:58:08 +02:00
Daniël de Kok	c5fff92b48	nix: partial incremental build of the router (#2416 ) This is less incremental than crate2nix, but does build all dependencies separately, so avoids full rebuilds.	2024-08-14 11:06:28 +02:00
Nicolas Patry	cd9b15d17f	Adding more kernels to flake. (#2411 )	2024-08-13 10:49:18 +02:00
Daniël de Kok	6f4bb4f26f	nix: incremental build of the launcher (#2410 )	2024-08-13 10:44:15 +02:00
Nicolas Patry	19ea85f8dc	Updating the flake. (#2404 )	2024-08-12 18:09:16 +02:00
Nicolas Patry	730fa00e20	Adding launcher to build. (#2397 )	2024-08-12 14:08:46 +02:00
Daniël de Kok	01a515dea2	nix: add router to the devshell (#2396 )	2024-08-12 09:28:38 +02:00
Daniël de Kok	6e127dcc96	flake: use rust-overlay (#2390 )	2024-08-09 15:24:21 +02:00
Daniël de Kok	977534bcb8	flake: add fmt and clippy (#2389 )	2024-08-09 14:56:20 +02:00
Daniël de Kok	c6d5039cd7	Add experimental flake (#2384 ) Add flake.nix	2024-08-09 12:32:37 +02:00

43 Commits