compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because
- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
quantizers.
- Configurable exclusions for quantization.
This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.
The following types of quantization are supported in this PR:
- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
Support for other quantization types will be added in subsequent PRs.
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
* fix: only check model type if config exists
* fix: adjust sharding and lm head logic
* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix: return correct shape logits and add streaming test
* fix: remove unused import and refactor test
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: add support for qwen2 vl model
* feat: fix token padding, enable warmup and process basic request
* fix: improve get_position_ids, add lift embed_tokens
* fix: remove get_cos_sin_hack dev function
* feat: add simple test chat with meesage and text
* fix: lint test
* fix: adjust positional embeddings for multi dimensional position ids
* fix: update docs and lint unused vars
* fix: include linted file
* fix: add norm after text output
* fix: format model file
* fix: adjust for ruff lints
* fix: remove unused rotate_half
* feat: refactors and calc num features
* fix: prefer position_ids passed from vlm causal lm and reset ids on batch
* fix: adjust get_position_ids if not available and add required args to signatures
* fix: adjust resize case for qwen2_vl warmup
* fix: avoid qwen2 vl specific paths with qwen2
* We can have a tokenizer anywhere.
* Handling potential lack of offsets (python tokenizer)
* Remove redundancy.
* Fixing the tests.
* Flake.lock update ?
* Fixing the GIL locking.
* Fixing mamba by using the transformers version.
* Adding the legacy handle.
* Ellide lifetime.
* Lint.
* Deprecation message.
* Fixing bad rebase.
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels
Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.
* Update test snapshots
* Add support for FP8 KV cache scales
Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.
This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:
- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).
Currently, scales are only used with an `float8_e4m3fn` cache.
Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.
* Update FP8 KV cache test to use checkpoint with scales
* `can_scale`: check that the attention is flashinfer
Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.
I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.
* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Simplifying conditionals + reverting integration tests values.
* Unused import
* Fix redundant import.
* Revert change after rebase.
* Upgrading the tests (TP>1 fix changes to use different kernels.)
* Update server/text_generation_server/layers/gptq/__init__.py
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
* Simplify the `attention` function
- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
`PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).
* Fixup flashinfer support
XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.
Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* Add basic FP8 KV cache support
This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:
* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.
* Fix Cargo.toml
* Working loading state.
* Preprocessing.
* Working state ? (Broke idefics1 temporarily).
* Cleaner condition.
* Fix idefics.
* Updating config, removing TODO
* Mllama
* Ugrade transformers 4.45
* Flashing mllama.
* Starting to get there.
* Working state.
* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.
* Updating model link.
* Earlier assert.
* Fix vlm ?
* remove log.
* Force ignore all images but last.
* Default dtype bfloat16.
* Update integration test after switch to bf16.
* Remove dead code.
* Removed dead code.
* Upgrade the flake to latest transformers/tokenizers
* Move to hf tgi-nix
* Upgrade to 0.5.0
* feat: support phi3.5 moe model loading
* fix: prefer llama base model and improve rotary logic
* feat: return reasonable generation and add integration test
* fix: run lint and update docs
* fix: rerun lint for openapi docs
* fix: prefer do_sample false unless temp is set by user, and update chat tests
* fix: small typo adjustments
* fix: consolidate long rope paths
* fix: revert greedy by default and test changes
* Vendor configuration so that we don't have to `trust_remote_code`
* Use SparseMoELayer
* Add support for dense MoE
* Some type annotations
* Add the usual model tests
* Ruff.
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:
- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s
* Add support for scalar FP8 weight scales
* Support LLM compressor FP8 checkpoints on H100
On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.
* Remove stray debug print
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner