* Add support for repacking AWQ weights for GPTQ-Marlin
So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).
This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.
* Enable Marlin for supported AWQ configurations by default
This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
Use FP8 GPTQ-Marlin kernels to enable FP8 support on CUDA GPUs
with compute capability >=8.0 and <8.9.
Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com>
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:
* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.
Fixes#2098.
Add support for GPTQ Marlin kernels
GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:
- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false
Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.
The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.