8511669cb2
Quantized weights were loaded in the `Weights` class, but this was getting quite unwieldy, where every higher level method to load weights was a long conditional to cover all the different quantizers. This change moves loading of quantized weights out of the `Weights` class. This is done by defining a simple `WeightsLoader` interface that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`, and `MarlinWeightsLoader`. These implementations are in the quantizers' respective modules. The `Weights` class provides the low-level load operations (such as loading tensors or sharded tensors), but delegates loads that need quantizer-specific weight processing to a loader. The loaders still use the low-level functionality provided by `Weights`. I initially tried making a hierarchy where a class like `GPTQWeights` would inherit from `Weights`. But it is not very flexible (e.g. does not work well with the new weight storage mock used in tests) and the implicit indirections made the code harder to follow. |
||
---|---|---|
.. | ||
custom_kernels | ||
exllama_kernels | ||
exllamav2_kernels | ||
marlin | ||
tests | ||
text_generation_server | ||
.gitignore | ||
Makefile | ||
Makefile-awq | ||
Makefile-eetq | ||
Makefile-flash-att | ||
Makefile-flash-att-v2 | ||
Makefile-lorax-punica | ||
Makefile-selective-scan | ||
Makefile-vllm | ||
README.md | ||
poetry.lock | ||
pyproject.toml | ||
requirements_cuda.txt | ||
requirements_intel.txt | ||
requirements_rocm.txt |
README.md
Text Generation Inference Python gRPC Server
A Python gRPC server for Text Generation Inference
Install
make install
Run
make run-dev