hf_text-generation-inference

History

Nicolas Patry b57f370386 Saving some VRAM. (#2790 ) * Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.		2024-12-03 04:04:21 +01:00
..
custom_modeling	…
__init__.py	Use FP8 KV cache when specified by compressed-tensors (#2761 )	2024-11-26 08:27:41 +01:00
bloom.py	…
causal_lm.py	Sync (most) server dependencies with Nix (#2782 )	2024-12-03 04:04:06 +01:00
flash_causal_lm.py	Saving some VRAM. (#2790 )	2024-12-03 04:04:21 +01:00
galactica.py	…
globals.py	…
idefics_causal_lm.py	…
mamba.py	…
metadata_kernels.py	feat: add payload limit (#2726 )	2024-11-21 18:20:15 +00:00
mllama_causal_lm.py	…
model.py	…
pali_gemma.py	…
seq2seq_lm.py	…
types.py	…
vlm_causal_lm.py	…