* adding max_token_capacity_metric
* added tgi to name of metric
* Adding max capacity metric.
* Add description for the metrics
---------
Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
* Adding a test for FD.
* Fixing flashdecoding (empty batch doesn't work).
* Fixing the invalid popping.
* Fixing radix with block_size > 1
* Last reference.
* Use an actual hash.
* Update hash for slice.len() == 1
* Update the locks.
* Increasing docker timeout.
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).
* Fixing the builds ?
* Fix the gh action?
* Fixing the location ?
* Validation is odd.
* Try a faster runner
* Upgrade python version.
* Remove sccache
* No sccache.
* Getting libpython maybe ?
* List stuff.
* Monkey it up.
* have no idea at this point
* Tmp.
* Shot in the dark.
* Tmate the hell out of this.
* Desperation.
* WTF.
* -y.
* Apparently 3.10 is not available anymore.
* Updating the dockerfile to make libpython discoverable at runtime too.
* Put back rust tests.
* Why do we want mkl on AMD ?
* Forcing 3.11 ?
* Adding prefix test.
* [WIP] tmp dump of integration load tests.
* Remove other tensor creation.
* Fixed the radix tree.
Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.
* Fix parsing
* Is it really flashinfer version ?
* Remove some comments.
* Revert the max prefix hit.
* Adding numpy to diff.
* Upgraded flashinfer.
* Upgrading some stuff.
* Are we done yet ?
* Minor fixup
* Remove 1 log and put back the other.
* Add comment for why slot 0 is OK.
* Mounting on the job.
* Get me a debug branch
* Debugging CIs is fun.
* Attempt #28
* wip
* Tmate.
* Praying.
* Updating VLM causal model with updated context.
* Important line got squashed.
* Tmate again.
* Fingers crossed.
* We want only 1 run of integration tests.....
---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
* Making prefix/flashinfer the default and testing the full release tests.
* Include flashinfer in the docker.
* Using prebuilt.
* Allowing window_left_size (dummy version).
* Disabling flashinfer/prefix caching on odd head_dim
* Disable prefix caching for lora.
* More specific codes.
* Update lock
* Updating integration tests with new values with FI/FD.
Remove paged as a default too, and using FD everywhere.
* Update cargo lock ?
* Upgrade to 1.80 because of bitstream...
* Everywhere 1.80
* Forgot last default place.
* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Updated flake lock
* Tmp
* Upgrade resolution system for less errors in resolution.
* Remove lambda for cleaner function.
* Handling debugger.
* OVerride the env in server tests.
* Is this enough to make it work ?
* This seems to be working.
* Downgrade some logs.
* Fixing the default for vlm.
* Don't enable prefix caching on VLM just yet.
* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)
* Fixing prefix caching for flashdecoding.
* Update all models.
* Fixed flashinfer version.
* add_special_tokens is internal only
* Fixing seqlen with the new vlms.
* Fixing the issue with `add_special_tokens` not being passed around.
* Fixing the test.
* Removing encoder_decoder (seq2seq).
* Update the chat test.
* Fixing the batching tokenization in flash causal lm.
* Truncating left for radix purposes.
* Oops this doesn't belong here.
* Put back default pure shell.
* Update server tests
- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room
* Only n_heads / process_group.size() are necessary.
* Revert the integrationt tests change (seem linked to head_size
modification).
* Adding error message when assert is violated.
* Fixing the free algorithm to handle times where the common prefix is
smaller.
* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Fix disabling prefix caching - Fix windowing checks.
* Revert the Cohere tokenizer change (for now using a revision instead).
* Fmt.
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* (backend) use parking_lot crate for RwLock fairness
* (docker) let's put rust in the TRTLLM folder when building
* (docker) build ompi with SLURM support
* (launcher) default new server::run parameters to false for now
* (chore) fmt ... why?
This change adds support for prefix caching to the v3 router. This
is broken up from the backend support to ease reviewing.
For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
in this case, the router will switch to `RadixAllocator`. This
allocator uses a radix trie to keep track of prefills that were
seen prior. If a new prefill is a prefix of a previously-seen
prefil, the router will send a request with `prefix_len>0`, which
can be used by the backend to decide to reuse KV blocks from the
cache, rather than recomputing them.
Even though backend support is not added in this PR, the backend
will still work with prefix caching enabled. The prefix lengths
are just ignored and not used.