a785000842
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs. |
||
---|---|---|
.. | ||
custom_kernels | ||
exllama_kernels | ||
exllamav2_kernels | ||
tests | ||
text_generation_server | ||
.gitignore | ||
Makefile | ||
Makefile-awq | ||
Makefile-eetq | ||
Makefile-exllamav2 | ||
Makefile-flash-att | ||
Makefile-flash-att-v2 | ||
Makefile-flashinfer | ||
Makefile-lorax-punica | ||
Makefile-selective-scan | ||
Makefile-vllm | ||
README.md | ||
poetry.lock | ||
pyproject.toml | ||
requirements_cuda.txt | ||
requirements_intel.txt | ||
requirements_rocm.txt |
README.md
Text Generation Inference Python gRPC Server
A Python gRPC server for Text Generation Inference
Install
make install
Run
make run-dev