a785000842
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs. |
||
---|---|---|
.. | ||
basic_tutorials | ||
conceptual | ||
reference | ||
_toctree.yml | ||
architecture.md | ||
index.md | ||
installation.md | ||
installation_amd.md | ||
installation_gaudi.md | ||
installation_inferentia.md | ||
installation_intel.md | ||
installation_nvidia.md | ||
quicktour.md | ||
supported_models.md | ||
usage_statistics.md |