* Fix cargo-chef prepare
In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.
* Fix Dockerfile_amd and Dockerfile_intel
* New runner. Manual squash.
* Network host.
* Put back trufflehog with proper extension.
* No network host ?
* Moving buildx install after tailscale ?
* 1.79
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
* Set maximum grpc message receive size to 2GiB
The previous default was 4MiB, which doesn't really work well for
multi-modal models.
* Update to Rust 1.79.0
* Fixup formatting to make PR pass
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.
Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
Add support for GPTQ Marlin kernels
GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:
- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false
Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.
The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
* feat: add kserve feature and basic routes
* feat: implement infer endpoint wrapper around generate
* fix: refactor and improve types
* fix: improve infer and simplify
* fix: cleanup and improve api docs
* fix: refactor and encapsulate kserve feat in file
* fix: remove typos after rebase
Add support for Phi-3-medium
The main difference between the medium and mini models is that medium
uses grouped query attention with a packed QKV matrix. This change adds
support for GQA with packed matrixes to `Weights.get_weights_col_packed`
and uses it for Phi-3. This also allows us to remove the custom
implementation of GQA from dbrx attention loading.
* update vllm commit & fix models using sliding window
* update
* update commit
* fix bug where tunableop is bound to cuda graph even when cuda graph are disabled
* enable tunableop by default
* fix sliding window
* address review
* dead code
* precise comment
* is it flaky?
The router will now send the input as chunks besides as a single
string. This change modifies the server to process chunked input
rather than strings. This also allows us to remove the image
extraction code from the server.