Adding docs about GPTQ usage.
This commit is contained in:
parent
16d0fb04ae
commit
17837b1e51
20
README.md
20
README.md
|
@ -45,6 +45,7 @@ to power LLMs api-inference widgets.
|
|||
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
|
||||
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
|
||||
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
|
||||
- Quantization with [gptq](https://github.com/qwopqwop200/GPTQ-for-LLaMa). 4x less RAM usage, with same latency
|
||||
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
|
||||
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
|
||||
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
|
||||
|
@ -216,6 +217,25 @@ the kernels by using the `BUILD_EXTENSIONS=False` environment variable.
|
|||
|
||||
Be aware that the official Docker image has them enabled by default.
|
||||
|
||||
|
||||
### Quantization with GPTQ
|
||||
|
||||
GPTQ quantization requires sending data to the model, therefore we cannot quantize
|
||||
the model on the fly.
|
||||
|
||||
Instead we provide a script to create a new quantized model
|
||||
```
|
||||
text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq
|
||||
# Add --upload-to-model-id MYUSERNAME/falcon-40b to upload to the hub directly
|
||||
```
|
||||
|
||||
This will create a new directory with the quantized files which you can send use with
|
||||
```
|
||||
text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq
|
||||
```
|
||||
Use `text-generation-server quantize --help` for detailed usage and more options
|
||||
during quantization.
|
||||
|
||||
## Run BLOOM
|
||||
|
||||
### Download
|
||||
|
|
|
@ -68,6 +68,7 @@ struct Args {
|
|||
|
||||
/// Whether you want the model to be quantized. This will use `bitsandbytes` for
|
||||
/// quantization on the fly, or `gptq`.
|
||||
/// For `gptq` please check `text-generation-server quantize --help` for more information
|
||||
#[clap(long, env, value_enum)]
|
||||
quantize: Option<Quantization>,
|
||||
|
||||
|
|
|
@ -161,8 +161,23 @@ def quantize(
|
|||
trust_remote_code: bool = False,
|
||||
upload_to_model_id: Optional[str] = None,
|
||||
percdamp: float = 0.01,
|
||||
act_order: bool = False,
|
||||
act_order: bool = True,
|
||||
):
|
||||
"""
|
||||
`quantize` will download a non quantized model MODEL_ID, quantize it locally using gptq
|
||||
and output the new weights into the specified OUTPUT_DIR.
|
||||
|
||||
This quantization does depend on showing examples to the model.
|
||||
It is impossible to be fully agnostic to it (some models require preprompting,
|
||||
some not, some are made for English).
|
||||
If the quantized model is performing poorer than expected this is one way to investigate.
|
||||
This CLI doesn't aim to enable all and every use cases, but the minimal subset that should
|
||||
work most of the time, for advanced usage, please modify the script directly.
|
||||
|
||||
The quantization script and inference code was taken from https://github.com/qwopqwop200/GPTQ-for-LLaMa
|
||||
and updated to be more generic to support more easily a wider range of models.
|
||||
"""
|
||||
|
||||
download_weights(
|
||||
model_id=model_id,
|
||||
revision=revision,
|
||||
|
|
Loading…
Reference in New Issue