* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner
* Adding a test for FD.
* Fixing flashdecoding (empty batch doesn't work).
* Fixing the invalid popping.
* Fixing radix with block_size > 1
* Last reference.
* Use an actual hash.
* Update hash for slice.len() == 1
* Update the locks.
* Increasing docker timeout.
* Add nix test.
* Modifying yourself means you need to rerun.
* Fixing the test + adding click (needed for pre-commit hooks).
* Try thuis.
* Our runner + pure test (not written)
* Reemove server.
* Root user.
* Different user ?
* Add the actual test target.
* Forgot this modification.
* Add a formatter.
* Add the secrets.
* Fixed the auth token ?
* Adding the other tests.
* Missing pre-commit.
* Test requires cargo for cargo fmt.
* Update it a bit.
* Up.
* Attempting to use a cache location for the models.
* Ignore the cache for now.
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).
* Fixing the builds ?
* Fix the gh action?
* Fixing the location ?
* Validation is odd.
* Try a faster runner
* Upgrade python version.
* Remove sccache
* No sccache.
* Getting libpython maybe ?
* List stuff.
* Monkey it up.
* have no idea at this point
* Tmp.
* Shot in the dark.
* Tmate the hell out of this.
* Desperation.
* WTF.
* -y.
* Apparently 3.10 is not available anymore.
* Updating the dockerfile to make libpython discoverable at runtime too.
* Put back rust tests.
* Why do we want mkl on AMD ?
* Forcing 3.11 ?
* Adding prefix test.
* [WIP] tmp dump of integration load tests.
* Remove other tensor creation.
* Fixed the radix tree.
Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.
* Fix parsing
* Is it really flashinfer version ?
* Remove some comments.
* Revert the max prefix hit.
* Adding numpy to diff.
* Upgraded flashinfer.
* Upgrading some stuff.
* Are we done yet ?
* Minor fixup
* Remove 1 log and put back the other.
* Add comment for why slot 0 is OK.
* Mounting on the job.
* Get me a debug branch
* Debugging CIs is fun.
* Attempt #28
* wip
* Tmate.
* Praying.
* Updating VLM causal model with updated context.
* Important line got squashed.
* Tmate again.
* Fingers crossed.
* We want only 1 run of integration tests.....
---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache
format kv input now.
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
The minimum batch size logic could cause prefix blocks to be
deallocated without prefill. The next allocation of the same
prefix would then use garbage blocks.
* Tied embeddings in MLP speculator.
* Fixing the scale_weight when users decide to not use the speculation as
much as defined in the config.
* Adding scaling support + optimize some ops.
* update doc with intel cpu part
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>