diff --git a/README.md b/README.md index 8c8d9773..d276f05e 100644 --- a/README.md +++ b/README.md @@ -45,6 +45,7 @@ to power LLMs api-inference widgets. - [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput - Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures - Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) +- Quantization with [gptq](https://github.com/qwopqwop200/GPTQ-for-LLaMa). 4x less RAM usage, with same latency - [Safetensors](https://github.com/huggingface/safetensors) weight loading - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) @@ -216,6 +217,25 @@ the kernels by using the `BUILD_EXTENSIONS=False` environment variable. Be aware that the official Docker image has them enabled by default. + +### Quantization with GPTQ + +GPTQ quantization requires sending data to the model, therefore we cannot quantize +the model on the fly. + +Instead we provide a script to create a new quantized model +``` +text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq +# Add --upload-to-model-id MYUSERNAME/falcon-40b to upload to the hub directly +``` + +This will create a new directory with the quantized files which you can send use with +``` +text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq +``` +Use `text-generation-server quantize --help` for detailed usage and more options +during quantization. + ## Run BLOOM ### Download diff --git a/launcher/src/main.rs b/launcher/src/main.rs index 36f6f6b6..ee4fa74b 100644 --- a/launcher/src/main.rs +++ b/launcher/src/main.rs @@ -68,6 +68,7 @@ struct Args { /// Whether you want the model to be quantized. This will use `bitsandbytes` for /// quantization on the fly, or `gptq`. + /// For `gptq` please check `text-generation-server quantize --help` for more information #[clap(long, env, value_enum)] quantize: Option, diff --git a/server/text_generation_server/cli.py b/server/text_generation_server/cli.py index aeb1f13b..6e8d6788 100644 --- a/server/text_generation_server/cli.py +++ b/server/text_generation_server/cli.py @@ -161,8 +161,23 @@ def quantize( trust_remote_code: bool = False, upload_to_model_id: Optional[str] = None, percdamp: float = 0.01, - act_order: bool = False, + act_order: bool = True, ): + """ + `quantize` will download a non quantized model MODEL_ID, quantize it locally using gptq + and output the new weights into the specified OUTPUT_DIR. + + This quantization does depend on showing examples to the model. + It is impossible to be fully agnostic to it (some models require preprompting, + some not, some are made for English). + If the quantized model is performing poorer than expected this is one way to investigate. + This CLI doesn't aim to enable all and every use cases, but the minimal subset that should + work most of the time, for advanced usage, please modify the script directly. + + The quantization script and inference code was taken from https://github.com/qwopqwop200/GPTQ-for-LLaMa + and updated to be more generic to support more easily a wider range of models. + """ + download_weights( model_id=model_id, revision=revision,