hf_text-generation-inference/server/text_generation_server/models
Nicolas Patry b57f370386
Saving some VRAM. (#2790)
* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.
2024-12-03 04:04:21 +01:00
..
custom_modeling
__init__.py Use FP8 KV cache when specified by compressed-tensors (#2761) 2024-11-26 08:27:41 +01:00
bloom.py
causal_lm.py Sync (most) server dependencies with Nix (#2782) 2024-12-03 04:04:06 +01:00
flash_causal_lm.py Saving some VRAM. (#2790) 2024-12-03 04:04:21 +01:00
galactica.py
globals.py
idefics_causal_lm.py
mamba.py
metadata_kernels.py feat: add payload limit (#2726) 2024-11-21 18:20:15 +00:00
mllama_causal_lm.py
model.py
pali_gemma.py
seq2seq_lm.py
types.py
vlm_causal_lm.py