hf_text-generation-inference

History

Daniël de Kok e903770897 Support different image sizes in prefill in VLMs (#2065 ) When a batch contained images if different sizes during prefill, the server would fail (see e.g. #2056). Images were processed separately and then concatenated. However, this can fail for images with different sizes. Fix this by preprocessing all images in the batch together, so that the image processor can ensure that all image tensors have compatible sizes.		2024-06-17 10:49:41 +02:00
..
custom_modeling	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
__init__.py	Update the link for qwen2 (#2068 )	2024-06-14 11:59:33 +02:00
bloom.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
causal_lm.py	server: use chunked inputs	2024-06-07 08:09:04 +02:00
flash_causal_lm.py	ROCm and sliding windows fixes (#2033 )	2024-06-10 15:09:50 +08:00
flash_cohere.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_dbrx.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_gemma.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_gpt2.py	Add support for Marlin-quantized models	2024-06-06 13:16:52 +02:00
flash_llama.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_mistral.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_mixtral.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
flash_neox.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_phi.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_qwen2.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_rw.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_santacoder.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
flash_starcoder2.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
galactica.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
globals.py	Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986 )	2024-05-31 17:57:01 +02:00
gpt_neox.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
idefics.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
idefics2.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
idefics_causal_lm.py	server: use chunked inputs	2024-06-07 08:09:04 +02:00
llava_next.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
mamba.py	server: use chunked inputs	2024-06-07 08:09:04 +02:00
model.py	Use the generation config. (#1808 )	2024-04-25 19:41:50 +02:00
mpt.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
opt.py	Add support for GPTQ Marlin (#2052 )	2024-06-14 09:45:42 +02:00
pali_gemma.py	server: use chunked inputs	2024-06-07 08:09:04 +02:00
phi.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
rw.py	fix(server): fix OPT implementation (#2061 )	2024-06-12 18:22:20 +02:00
santacoder.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
seq2seq_lm.py	server: use chunked inputs	2024-06-07 08:09:04 +02:00
t5.py	MLPSpeculator. (#1865 )	2024-05-14 12:33:18 +02:00
types.py	chore: add pre-commit (#1569 )	2024-02-16 11:58:58 +01:00
vlm_causal_lm.py	Support different image sizes in prefill in VLMs (#2065 )	2024-06-17 10:49:41 +02:00