Adding docs about GPTQ usage.
This commit is contained in:
parent
16d0fb04ae
commit
17837b1e51
20
README.md
20
README.md
|
@ -45,6 +45,7 @@ to power LLMs api-inference widgets.
|
||||||
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
|
- [Continuous batching of incoming requests](https://github.com/huggingface/text-generation-inference/tree/main/router) for increased total throughput
|
||||||
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
|
- Optimized transformers code for inference using [flash-attention](https://github.com/HazyResearch/flash-attention) on the most popular architectures
|
||||||
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
|
- Quantization with [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)
|
||||||
|
- Quantization with [gptq](https://github.com/qwopqwop200/GPTQ-for-LLaMa). 4x less RAM usage, with same latency
|
||||||
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
|
- [Safetensors](https://github.com/huggingface/safetensors) weight loading
|
||||||
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
|
- Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
|
||||||
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
|
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see [transformers.LogitsProcessor](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.LogitsProcessor))
|
||||||
|
@ -216,6 +217,25 @@ the kernels by using the `BUILD_EXTENSIONS=False` environment variable.
|
||||||
|
|
||||||
Be aware that the official Docker image has them enabled by default.
|
Be aware that the official Docker image has them enabled by default.
|
||||||
|
|
||||||
|
|
||||||
|
### Quantization with GPTQ
|
||||||
|
|
||||||
|
GPTQ quantization requires sending data to the model, therefore we cannot quantize
|
||||||
|
the model on the fly.
|
||||||
|
|
||||||
|
Instead we provide a script to create a new quantized model
|
||||||
|
```
|
||||||
|
text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq
|
||||||
|
# Add --upload-to-model-id MYUSERNAME/falcon-40b to upload to the hub directly
|
||||||
|
```
|
||||||
|
|
||||||
|
This will create a new directory with the quantized files which you can send use with
|
||||||
|
```
|
||||||
|
text-generation-launcher --model-id /data/falcon-40b-gptq/ --sharded true --num-shard 2 --quantize gptq
|
||||||
|
```
|
||||||
|
Use `text-generation-server quantize --help` for detailed usage and more options
|
||||||
|
during quantization.
|
||||||
|
|
||||||
## Run BLOOM
|
## Run BLOOM
|
||||||
|
|
||||||
### Download
|
### Download
|
||||||
|
|
|
@ -68,6 +68,7 @@ struct Args {
|
||||||
|
|
||||||
/// Whether you want the model to be quantized. This will use `bitsandbytes` for
|
/// Whether you want the model to be quantized. This will use `bitsandbytes` for
|
||||||
/// quantization on the fly, or `gptq`.
|
/// quantization on the fly, or `gptq`.
|
||||||
|
/// For `gptq` please check `text-generation-server quantize --help` for more information
|
||||||
#[clap(long, env, value_enum)]
|
#[clap(long, env, value_enum)]
|
||||||
quantize: Option<Quantization>,
|
quantize: Option<Quantization>,
|
||||||
|
|
||||||
|
|
|
@ -161,8 +161,23 @@ def quantize(
|
||||||
trust_remote_code: bool = False,
|
trust_remote_code: bool = False,
|
||||||
upload_to_model_id: Optional[str] = None,
|
upload_to_model_id: Optional[str] = None,
|
||||||
percdamp: float = 0.01,
|
percdamp: float = 0.01,
|
||||||
act_order: bool = False,
|
act_order: bool = True,
|
||||||
):
|
):
|
||||||
|
"""
|
||||||
|
`quantize` will download a non quantized model MODEL_ID, quantize it locally using gptq
|
||||||
|
and output the new weights into the specified OUTPUT_DIR.
|
||||||
|
|
||||||
|
This quantization does depend on showing examples to the model.
|
||||||
|
It is impossible to be fully agnostic to it (some models require preprompting,
|
||||||
|
some not, some are made for English).
|
||||||
|
If the quantized model is performing poorer than expected this is one way to investigate.
|
||||||
|
This CLI doesn't aim to enable all and every use cases, but the minimal subset that should
|
||||||
|
work most of the time, for advanced usage, please modify the script directly.
|
||||||
|
|
||||||
|
The quantization script and inference code was taken from https://github.com/qwopqwop200/GPTQ-for-LLaMa
|
||||||
|
and updated to be more generic to support more easily a wider range of models.
|
||||||
|
"""
|
||||||
|
|
||||||
download_weights(
|
download_weights(
|
||||||
model_id=model_id,
|
model_id=model_id,
|
||||||
revision=revision,
|
revision=revision,
|
||||||
|
|
Loading…
Reference in New Issue