Adding docs about GPTQ usage.

2023-06-15 19:41:04 +02:00 · 2023-06-15 19:41:04 +02:00 · 17837b1e51
parent 16d0fb04ae
commit 17837b1e51
3 changed files with 37 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -45,6 +45,7 @@ to power LLMs api-inference widgets.
 - [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
 - Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
 - Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
 - Quantization with [gptq](https://github.com/qwopqwop200/GPTQ-for-LLaMa). 4x less RAM usage, with same latency
 - [Safetensors](https://github.com/huggingface/safetensors) weight loading
 - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
 - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
@ -216,6 +217,25 @@ the kernels by using the `BUILD_EXTENSIONS=False` environment variable.
 Be aware that the official Docker image has them enabled by default.
 ### Quantization with GPTQ
 GPTQ quantization requires sending data to the model, therefore we cannot quantize
 the model on the fly.
 Instead we provide a script to create a new quantized model
 ```
 text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq
 # Add --upload-to-model-id MYUSERNAME/falcon-40b to upload to the hub directly
 ```
 This will create a new directory with the quantized files which you can send use with
 ```
 text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq
 ```
 Use `text-generation-server quantize --help` for detailed usage and more options
 during quantization.
 ## Run BLOOM
 ### Download
--- a/launcher/src/main.rs
+++ b/launcher/src/main.rs
@ -68,6 +68,7 @@ struct Args {
    /// Whether you want the model to be quantized. This will use `bitsandbytes` for
    /// quantization on the fly, or `gptq`.
    /// For `gptq` please check `text-generation-server quantize --help` for more information
    #[clap(long, env, value_enum)]
    quantize: Option<Quantization>,
--- a/server/text_generation_server/cli.py
+++ b/server/text_generation_server/cli.py
@ -161,8 +161,23 @@ def quantize(
    trust_remote_code: bool = False,
    upload_to_model_id: Optional[str] = None,
    percdamp: float = 0.01,
-    act_order: bool = False,
+    act_order: bool = True,
 ):
    """
    `quantize` will download a non quantized model MODEL_ID, quantize it locally using gptq
    and output the new weights into the specified OUTPUT_DIR.
    This quantization does depend on showing examples to the model.
    It is impossible to be fully agnostic to it (some models require preprompting,
    some not, some are made for English).
    If the quantized model is performing poorer than expected this is one way to investigate.
    This CLI doesn't aim to enable all and every use cases, but the minimal subset that should
    work most of the time, for advanced usage, please modify the script directly.
    The quantization script and inference code was taken from https://github.com/qwopqwop200/GPTQ-for-LLaMa
    and updated to be more generic to support more easily a wider range of models.
    """
    download_weights(
        model_id=model_id,
        revision=revision,