hf_text-generation-inference/server/text_generation_server/layers
Daniël de Kok 5e0fb46821
Make handling of FP8 scales more consisent (#2666)
Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.
2024-10-19 09:05:01 +02:00
..
attention Break cycle between the attention implementations and KV cache (#2627) 2024-10-17 14:54:22 +02:00
awq CI job. Gpt awq 4 (#2665) 2024-10-18 17:55:53 +02:00
gptq CI job. Gpt awq 4 (#2665) 2024-10-18 17:55:53 +02:00
marlin
moe
__init__.py
bnb.py
conv.py
eetq.py
exl2.py
fp8.py Make handling of FP8 scales more consisent (#2666) 2024-10-19 09:05:01 +02:00
layernorm.py
linear.py
lora.py
medusa.py
mlp.py
rotary.py
speculative.py
tensor_parallel.py