History

Abhinav M Kulkarni c35f39cf83 Add AWQ quantization inference support (#1019 ) # Add AWQ quantization inference support Fixes https://github.com/huggingface/text-generation-inference/issues/781 This PR (partially) adds support for AWQ quantization for inference. More information on AWQ [here](https://arxiv.org/abs/2306.00978). In general, AWQ is faster and more accurate than GPTQ, which is currently supported by TGI. This PR installs 4-bit GEMM custom CUDA kernels released by AWQ authors (in `requirements.txt`, just one line change). Quick way to test this PR would be bring up TGI as follows: ``` text-generation-server download-weights abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq text-generation-launcher \ --huggingface-hub-cache ~/.cache/huggingface/hub/ \ --model-id abhinavkulkarni/codellama-CodeLlama-7b-Python-hf-w4-g128-awq \ --trust-remote-code --port 8080 \ --max-input-length 2048 --max-total-tokens 4096 --max-batch-prefill-tokens 4096 \ --quantize awq ``` Please note: * This PR was tested with FlashAttention v2 and vLLM. * This PR adds support for AWQ inference, not quantizing the models. That needs to be done outside of TGI, instructions [here](`f084f40bd9`). * This PR only adds support for `FlashLlama` models for now. * Multi-GPU setup has not been tested. * No integration tests have been added so far, will add later if maintainers are interested in this change. * This PR can be tested on any of the models released [here](https://huggingface.co/abhinavkulkarni?sort_models=downloads#models). Please refer to the linked issue for benchmarks for [abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq](https://huggingface.co/abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq) vs [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ). Please note, AWQ has released faster (and in case of Llama, fused) kernels for 4-bit GEMM, currently at the top of the `main` branch at https://github.com/mit-han-lab/llm-awq, but this PR uses an older commit that has been tested to work. We can switch to latest commit later on. ## Who can review? @OlivierDehaene OR @Narsil --------- Co-authored-by: Abhinav Kulkarni <abhinav@concentric.ai>		2023-09-25 09:58:02 +02:00
..
custom_kernels	feat(server): Rework model loading (#344 )	2023-06-08 14:51:52 +02:00
exllama_kernels	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
tests	Rebased #617 (#868 )	2023-08-28 11:43:47 +02:00
text_generation_server	Add AWQ quantization inference support (#1019 )	2023-09-25 09:58:02 +02:00
.gitignore	Version 1.0.1 (#836 )	2023-08-14 11:23:11 +02:00
Makefile	fix(server): fix missing datasets in quantize	2023-07-27 14:50:45 +02:00
Makefile-flash-att	feat(server): use latest flash attention commit (#543 )	2023-07-04 20:23:55 +02:00
Makefile-flash-att-v2	feat(server): flash attention v2 (#624 )	2023-07-18 16:21:18 +02:00
Makefile-vllm	Backport https://github.com/vllm-project/vllm/pull/936 (#977 )	2023-09-04 15:00:19 +02:00
README.md	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
poetry.lock	New release. (#941 )	2023-08-29 14:28:22 +02:00
pyproject.toml	New release. (#941 )	2023-08-29 14:28:22 +02:00
requirements.txt	Add AWQ quantization inference support (#1019 )	2023-09-25 09:58:02 +02:00

README.md

Text Generation Inference Python gRPC Server

A Python gRPC server for Text Generation Inference

Install

make install

Run

make run-dev