362883f259
As per title & reported https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956 https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5 Test it: ``` GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq ``` & ``` curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \ -H 'Content-Type: application/json' ``` |
||
---|---|---|
.. | ||
__init__.py | ||
bloom_modeling.py | ||
flash_llama_modeling.py | ||
flash_neox_modeling.py | ||
flash_rw_modeling.py | ||
flash_santacoder_modeling.py | ||
mpt_modeling.py | ||
neox_modeling.py | ||
opt_modeling.py | ||
t5_modeling.py |