Adding docs about GPTQ usage.

This commit is contained in:
Nicolas Patry 2023-06-15 19:41:04 +02:00
parent 16d0fb04ae
commit 17837b1e51
3 changed files with 37 additions and 1 deletions

View File

@ -45,6 +45,7 @@ to power LLMs api-inference widgets.
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput - [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures - Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) - Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
- Quantization with [gptq](https://github.com/qwopqwop200/GPTQ-for-LLaMa). 4x less RAM usage, with same latency
- [Safetensors](https://github.com/huggingface/safetensors) weight loading - [Safetensors](https://github.com/huggingface/safetensors) weight loading
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226) - Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor)) - Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
@ -216,6 +217,25 @@ the kernels by using the `BUILD_EXTENSIONS=False` environment variable.
Be aware that the official Docker image has them enabled by default. Be aware that the official Docker image has them enabled by default.
### Quantization with GPTQ
GPTQ quantization requires sending data to the model, therefore we cannot quantize
the model on the fly.
Instead we provide a script to create a new quantized model
```
text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq
# Add --upload-to-model-id MYUSERNAME/falcon-40b to upload to the hub directly
```
This will create a new directory with the quantized files which you can send use with
```
text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq
```
Use `text-generation-server quantize --help` for detailed usage and more options
during quantization.
## Run BLOOM ## Run BLOOM
### Download ### Download

View File

@ -68,6 +68,7 @@ struct Args {
/// Whether you want the model to be quantized. This will use `bitsandbytes` for /// Whether you want the model to be quantized. This will use `bitsandbytes` for
/// quantization on the fly, or `gptq`. /// quantization on the fly, or `gptq`.
/// For `gptq` please check `text-generation-server quantize --help` for more information
#[clap(long, env, value_enum)] #[clap(long, env, value_enum)]
quantize: Option<Quantization>, quantize: Option<Quantization>,

View File

@ -161,8 +161,23 @@ def quantize(
trust_remote_code: bool = False, trust_remote_code: bool = False,
upload_to_model_id: Optional[str] = None, upload_to_model_id: Optional[str] = None,
percdamp: float = 0.01, percdamp: float = 0.01,
act_order: bool = False, act_order: bool = True,
): ):
"""
`quantize` will download a non quantized model MODEL_ID, quantize it locally using gptq
and output the new weights into the specified OUTPUT_DIR.
This quantization does depend on showing examples to the model.
It is impossible to be fully agnostic to it (some models require preprompting,
some not, some are made for English).
If the quantized model is performing poorer than expected this is one way to investigate.
This CLI doesn't aim to enable all and every use cases, but the minimal subset that should
work most of the time, for advanced usage, please modify the script directly.
The quantization script and inference code was taken from https://github.com/qwopqwop200/GPTQ-for-LLaMa
and updated to be more generic to support more easily a wider range of models.
"""
download_weights( download_weights(
model_id=model_id, model_id=model_id,
revision=revision, revision=revision,