* Using flash decoding
Conditional flashdecoding.
Fix max_q.
Working kvcache
Working version with flash decoding.
Make it work for mistral.
Fix after rebase..
Less intrusive.
REvert changes in modeling.
Speedup flashdecoding.
HHachweew
Hack to make other models work.
Fixing non flash decoding llama path.
Router logic knows about page size.
Missing 2 models.
Missing cohere.
Fixing cohere flash decoding.
Revamped all this architecture.
Fix cohere.
Fixing falcon.
Enabling custom block size schedule.
Update router/src/infer.rs
Not sending preallocated output.
* Making it work on non flash decoding.
* Fix Cohere.
* Fix non decoding paths.
* Rebased.
* No need for cache_manager anymore.
* Update?
* "ipex" -> "cpu"
* These do not belong.
* Factoring cu_seqlen_qk for better abstracting over every model.
* Fixing non flash tests/imports.
* Changing return everywhere.
* Update mistral past.
* Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).
* Fixup mistral clamping (had issues with cuda graphs).
* No need to recreate anything actually.
* refine get xpu free memory
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable qwen2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable gemma/gemma2/phi in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.
For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices]
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* fix: refactor post_processor logic and add test
* fix: remove dev comment
* fix: adjust when post_processor is overridden and improve create_post_processor
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes#2029.
While at it, also remove all the image token handling duplication
in `prepare_input`.
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:
* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.
Fixes#2098.
When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.
Correctly pass through the bias when it is not `None`.
Fixes#2106.
* feat: first draft load multiple lora
* feat: load weights within layer and refactor lora pass
* fix: refactor and reduce lora math
* feat: baseline impl single request multi lora support
* feat: prefer lorax implementation and port loading logic
* fix: prefer adapter_data and refactors
* feat: perfer loraxs custom punica kernels and add mlp loras
* fix: adjust batch for bgmv
* fix: adjust adapter_segments logic when in batch
* fix: refactor and move changes to v3 proto
* fix: pass model_id for all flash causal lms
* fix: pass model_id for all causal and seq2seq lms
* fix: add model_id to model test
* feat: add lora support to mistral and refactors
* feat: prefer model id in request
* fix: include rust code for adapter id
* feat: bump launcher and add new lora docs
* feat: support base model generation and refactors
* fix: rename doc to retry ci build
* feat: support if vlm models
* fix: add adapter_data param and avoid missing layers
* fix: add adapter_data param to phi and neox
* fix: update all models forwards to include adapter_data
* fix: add model_id to IdeficsCausalLM
* Update lora.md
Fixed a typo
* Update lora.md
Fixing spam image
* fix: add lora kernel to dockerfile, support running without kernels and refactors
* fix: avoid dockerfile conflict
* fix: refactors and adjust flash llama lora logic
* fix: skip llama test due to CI issue (temp)
* fix: skip llama test CI (temp) 2
* fix: revert skips and prefer updated ci token for tests
* fix: refactors and helpful comments
* fix: add noop in TensorParallelAdapterRowLinear too
* fix: refactor and move shard_lora_weights logic
* fix: exit early if no adapter_data
---------
Co-authored-by: Derek <datavistics@gmail.com>
* Add pytest release marker
Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.
* Mark many models as `release` to speed up CI
* Removing IPEX_AVAIL.
Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.
The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.
* Forgot a few places.
* Unrelated change.
* Fixing HF_TOKEN.
* HF_TOKEN
* add CPU tgi support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* ipex distributed ops support
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Update server/text_generation_server/utils/import_utils.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Fix cargo-chef prepare
In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.
* Fix Dockerfile_amd and Dockerfile_intel
* New runner. Manual squash.
* Network host.
* Put back trufflehog with proper extension.
* No network host ?
* Moving buildx install after tailscale ?
* 1.79
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
* Set maximum grpc message receive size to 2GiB
The previous default was 4MiB, which doesn't really work well for
multi-modal models.
* Update to Rust 1.79.0
* Fixup formatting to make PR pass
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.
Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.