Quantized weights were loaded in the `Weights` class, but this was
getting quite unwieldy, where every higher level method to load weights
was a long conditional to cover all the different quantizers.
This change moves loading of quantized weights out of the `Weights`
class. This is done by defining a simple `WeightsLoader` interface
that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
and `MarlinWeightsLoader`. These implementations are in the quantizers'
respective modules. The `Weights` class provides the low-level load
operations (such as loading tensors or sharded tensors), but delegates
loads that need quantizer-specific weight processing to a loader. The
loaders still use the low-level functionality provided by `Weights`.
I initially tried making a hierarchy where a class like `GPTQWeights`
would inherit from `Weights`. But it is not very flexible (e.g. does
not work well with the new weight storage mock used in tests) and
the implicit indirections made the code harder to follow.
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.
For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:
* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.
Fixes#2098.
Add support for GPTQ Marlin kernels
GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:
- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false
Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.
The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.
Tested with:
- Llama 2
- Llama 3
- Phi 3