compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because
- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
quantizers.
- Configurable exclusions for quantization.
This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.
The following types of quantization are supported in this PR:
- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
Support for other quantization types will be added in subsequent PRs.
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels
Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.
* Update test snapshots
* Add `impureWithCuda` dev shell
This shell is handy when developing some kernels jointly with TGI - it
adds nvcc and a bunch of commonly-used CUDA libraries to the environment.
We don't add this to the normal impure shell to keep the development
environment as clean as possible (avoid accidental dependencies, etc.).
* Add cuDNN
To make sure that everything is formatted with the same black version
as CI.
I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.
* nix: experimental support for building a Docker image
Run using something like:
```
docker run \
--device nvidia.com/gpu=all \
-it --rm -p 8080:80 \
-v $PWD/data:/data \
-v $PWD/tmp:/tmp \
tgi-docker:latest \
--model-id <model_id>
```
* Example of building the Docker image using Nix inside Docker
* Stream to make the builder image smaller
This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.
* Don't spam journalctl on Linux
* Other dockerfile.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s
* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
Updates tgi-nix input:
- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.
* nix: pure server and support both pure and impure devShells
* nix: remove unused poetry2nix input
It is not wired up and we now have a pure server.
* nix: add ipdb to impure devshell
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.