hf_text-generation-inference

History

Abhinav M Kulkarni c35f39cf83 Add AWQ quantization inference support (#1019 ) # Add AWQ quantization inference support Fixes https://github.com/huggingface/text-generation-inference/issues/781 This PR (partially) adds support for AWQ quantization for inference. More information on AWQ [here](https://arxiv.org/abs/2306.00978). In general, AWQ is faster and more accurate than GPTQ, which is currently supported by TGI. This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors (in `requirements.txt`, just one line change). Quick way to test this PR would be bring up TGI as follows: ``` text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq text-generation-launcher \ --huggingface-hub-cache ~/.cache/huggingface/hub/ \ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \ --trust-remote-code --port 8080 \ --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \ --quantize awq ``` Please note: * This PR was tested with FlashAttention v2 and vLLM. * This PR adds support for AWQ inference, not quantizing the models. That needs to be done outside of TGI, instructions [here](`f084f40bd9`). * This PR only adds support for `FlashLlama` models for now. * Multi-GPU setup has not been tested. * No integration tests have been added so far, will add later if maintainers are interested in this change. * This PR can be tested on any of the models released [here](https://huggingface.co/abhinavkulkarni?sort_models=downloads#models). Please refer to the linked issue for benchmarks for [abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq](https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq) vs [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ). Please note, AWQ has released faster (and in case of Llama, fused) kernels for 4-bit GEMM, currently at the top of the `main` branch at https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit that has been tested to work. We can switch to latest commit later on. ## Who can review? @OlivierDehaene OR @Narsil --------- Co-authored-by: Abhinav Kulkarni <abhinav@concentric.ai>		2023-09-25 09:58:02 +02:00
..
awq/quantize	Add AWQ quantization inference support (#1019 )	2023-09-25 09:58:02 +02:00
gptq	Fix __call__ vs forward. (#993 )	2023-09-07 17:36:30 +02:00
__init__.py	feat(server): Add native support for PEFT Lora models (#762 )	2023-08-03 17:22:45 +02:00
convert.py	fit for baichuan models (#981 )	2023-09-08 16:51:34 +02:00
dist.py	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
flash_attn.py	feat(server): flash attention v2 (#624 )	2023-07-18 16:21:18 +02:00
hub.py	feat(server): Adding new ignore_rule for conversion. (#485 )	2023-06-23 12:41:13 +02:00
layers.py	Add AWQ quantization inference support (#1019 )	2023-09-25 09:58:02 +02:00
logits_process.py	fix(server): avoid errors for very small top_p values (#544 )	2023-07-04 20:11:33 +02:00
peft.py	feat(server): Add native support for PEFT Lora models (#762 )	2023-08-03 17:22:45 +02:00
tokens.py	Fixing top_k tokens when k ends up < 0 (#966 )	2023-09-01 00:22:03 +02:00
watermark.py	Fixing watermark. (#851 )	2023-08-16 07:17:26 +02:00
weights.py	Add AWQ quantization inference support (#1019 )	2023-09-25 09:58:02 +02:00