Adding docs about GPTQ usage.

2023-06-15 19:41:04 +02:00 · 2023-06-15 19:41:04 +02:00 · 17837b1e51
parent 16d0fb04ae
commit 17837b1e51
3 changed files with 37 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -45,6 +45,7 @@ to power LLMs api-inference widgets.
 - [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
 - Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
 - Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
+- Quantization with [gptq](https://github.com/qwopqwop200/GPTQ-for-LLaMa). 4x less RAM usage, with same latency
 - [Safetensors](https://github.com/huggingface/safetensors) weight loading
 - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
 - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
@ -216,6 +217,25 @@ the kernels by using the `BUILD_EXTENSIONS=False` environment variable.

 Be aware that the official Docker image has them enabled by default.

+
+### Quantization with GPTQ
+
+GPTQ quantization requires sending data to the model, therefore we cannot quantize
+the model on the fly.
+
+Instead we provide a script to create a new quantized model
+```
+text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq
+# Add --upload-to-model-id MYUSERNAME/falcon-40b to upload to the hub directly
+```
+
+This will create a new directory with the quantized files which you can send use with
+```
+text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq
+```
+Use `text-generation-server quantize --help` for detailed usage and more options
+during quantization.
+
 ## Run BLOOM

 ### Download
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@ -68,6 +68,7 @@ struct Args {

    /// Whether you want the model to be quantized. This will use `bitsandbytes` for
    /// quantization on the fly, or `gptq`.
+    /// For `gptq` please check `text-generation-server quantize --help` for more information
    #[clap(long, env, value_enum)]
    quantize: Option<Quantization>,

--- a/server/text_generation_server/cli.py
+++ b/server/text_generation_server/cli.py
@ -161,8 +161,23 @@ def quantize(
    trust_remote_code: bool = False,
    upload_to_model_id: Optional[str] = None,
    percdamp: float = 0.01,
-    act_order: bool = False,
+    act_order: bool = True,
 ):
+    """
+    `quantize` will download a non quantized model MODEL_ID, quantize it locally using gptq
+    and output the new weights into the specified OUTPUT_DIR.
+
+    This quantization does depend on showing examples to the model.
+    It is impossible to be fully agnostic to it (some models require preprompting,
+    some not, some are made for English).
+    If the quantized model is performing poorer than expected this is one way to investigate.
+    This CLI doesn't aim to enable all and every use cases, but the minimal subset that should
+    work most of the time, for advanced usage, please modify the script directly.
+
+    The quantization script and inference code was taken from https://github.com/qwopqwop200/GPTQ-for-LLaMa
+    and updated to be more generic to support more easily a wider range of models.
+    """
+
    download_weights(
        model_id=model_id,
        revision=revision,