The default package wraps the launcher and puts the server/router in the
path.
As a result, TGI can be started using something like:
```
nix run .# -- \
--model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--port 8080
```
* nix: pure server and support both pure and impure devShells
* nix: remove unused poetry2nix input
It is not wired up and we now have a pure server.
* nix: add ipdb to impure devshell
* All integration tests back everywhere (too many failed CI).
* Upgrade integration tests after 12.4
* Attempt to remove the specifed compute cap.
* Common arch list.
* Punica uses raw ASM which is not valid on 9.0 apparently.
* doc: Add metrics documentation and add a 'Reference' section
* doc: Add API reference
* doc: Refactor API reference
* fix: Message API link
* Bad rebase
* Moving the docs.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.
* Fixing exl2 and other quanize tests again.
* Mark exl2 as non release (so CI tests them, needs to be removed latet).
* Fixing exl2 (by disabling cuda graphs)
* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).
* Removing serde override.
* Go back to released exl2 and remove log.
* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
* (backend) use parking_lot crate for RwLock fairness
* (docker) let's put rust in the TRTLLM folder when building
* (docker) build ompi with SLURM support
* (launcher) default new server::run parameters to false for now
* (chore) fmt ... why?
* fix: improve completions to send a final chunk with usage details
* fix: include finish reason string
* fix: remove dev debug trait and unneeded mut
* fix: update openapi schema
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.
For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.
Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.
This change adds support for FlashInfer. FlashInfer can be enabled using
`FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
Since this functionality is currently only for testing, FlashInfer is
not installed anywhere yet.
The FlashInfer API is quite different from FlashAttention/vLLM in that
it requires more global bookkeeping:
* A wrapper class needs to be contstructed (which we just call *state*).
Since this is fairly expensive (due to pinned host memory allocation),
we only do this once in a FlashCausalLM instance or for each CUDA
Graph size.
* Each model forward call needs to be wrapped in `begin_forward` and
`end_forward`. This sets up data structures that can be reused for all
calls to attention for that forward call.
When calling attention, we need access to the state object. To avoid
passing an argument down the call chain (which would require changes to
all models), we use a context variable.
Each model forward call is wrapped using a context manager that does all
the bookkeeping for such a call:
* Set the context variable to the forward call's state.
* Call `begin_forward` on the state.
* Yield.
* Call `end_forward` on the state.
* Reset the context variable.
We cannot use a single shared global variable for this, since e.g. CUDA
Graphs of different sizes each have their own state.
* Fix unsigned integer underflow
Passing --max-batch-size to the launcher actually had no effect
because after a few requests the max_size passed to State::next_batch
would underflow becoming a largo positive number.
In the scheduler, as soon as the cached batch size reached the
max_batch_size the max_size passed to next_batch becomes 0.
Since the only check in that funcion is
```
if Some(batch_requests.len()) == max_size {
break;
}
```
and it's called after the `batch_requests.len()` has
become 1, it doesn't do anything to prevent more than 0
requests from being batched.
Now we have cached batch in the server that is large than
max_batch_size and `max_size - batch_size as usize`
underflows.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
* fix: update v3 scheduler and ensure max_batch_size > 0
---------
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
* Update Quantization docs and minor doc fix.
* update readme with latest quants info
* Apply suggestions from code review
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* up
---------
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* hotfix: fix xpu crash brought by code refine. torch.xpu rely on import ipex
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* reable gemma2 in xpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix in regression in ipex flashattention
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>