Commit Graph

1049 Commits

Author SHA1 Message Date
Mohit Sharma 11d7af730b add cloning in Dockerfile 2024-10-04 17:41:02 +00:00
Mohit Sharma 862651a90d ensure lfs files are downloaded 2024-10-04 17:21:01 +00:00
Mohit Sharma fc00efb2e7 Update tuned file with bf16 tuned ops 2024-10-04 17:12:31 +00:00
Mohit Sharma 066d3b1fe8 Update tuned file 2024-10-04 17:08:28 +00:00
Mohit Sharma 78776cdd25 add tuned config 2024-10-03 11:59:14 +00:00
Mohit Sharma 50d239ba8f revert pytorch 2024-10-02 12:34:56 +00:00
Daniël de Kok d1f257ac56
Move flake back to tgi-nix `main` (#2586) 2024-09-30 11:39:41 +02:00
drbh 93a7042d7e
feat: support phi3.5 moe (#2479)
* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-30 11:15:09 +02:00
Daniël de Kok 90a1d04a2f
Add support for GPTQ-quantized MoE models using MoE Marlin (#2557)
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
2024-09-30 11:14:32 +02:00
Mohit Sharma f9e561eced
Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-09-30 10:54:32 +02:00
Ikram Ul Haq e790cfc0e4
Update architecture.md (#2577) 2024-09-30 08:56:20 +02:00
Daniël de Kok afc7ded84f
Remove compute capability lazy cell (#2580)
Remove compute capability lock

We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
2024-09-30 08:48:47 +02:00
Daniël de Kok 1028996fb3
flashinfer: pass window size and dtype (#2574) 2024-09-28 18:41:41 +02:00
Daniël de Kok 5b6b74e21d
Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-09-27 16:19:42 +02:00
Alvaro Bartolome 0aa66d693a
Fix build with `--features google` (#2566)
* Fix `cargo build --features google`

* Add `cargo test --features google`
2024-09-26 11:41:38 +02:00
Alvaro Bartolome 0b7df77178
Add LoRA adapters support for Gemma2 (#2567)
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
2024-09-26 10:54:08 +02:00
Nicholas Broad 7efcb5e0ed
remove LORA_ADAPTERS_PATH (#2563)
specify how to call local adapters
2024-09-25 01:20:15 +02:00
Nicolas Patry dd8691b7c5
More tensor cores. (#2558)
* More tensor cores.

* Fixing the logic.

* Gemma is modified by this.
2024-09-24 23:57:26 +02:00
Nicolas Patry c032280b17
Cleanup Vertex + Chat (#2553)
* Cleanup Vertex + Chat

* logprobs defaults to false.

* Parameters are optional

* Fix  docs.

* Changing back this logprobs default.

* Fixup doc.

* Let's debug that.

* Not unstable.

* Updating Cargo ?

* Wat?

* Dummy change.

* Trying some other install.

* Trying smething.

* Revert everything.

* Update Cargo lock.

* Fixing the pre-commit after rebase.
2024-09-24 23:37:17 +02:00
Nicolas Patry 75c8c54ac9
Hotfixing main. (#2562) 2024-09-24 23:00:43 +02:00
Aritra Roy Gosthipaty e6d29656b5
Adding note for private models in quick-tour document (#2548)
* chore: adding note for private models in quicktour doc

* Update docs/source/quicktour.md

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
2024-09-24 15:06:53 +02:00
Orhun Parmaksız 8024ded58f
Simplify crossterm imports (#2545) 2024-09-24 14:57:20 +02:00
Orhun Parmaksız 03263f5e88
Update the link to the Ratatui organization (#2546) 2024-09-24 14:51:48 +02:00
Daniël de Kok 3f14cd1420
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 (#2537)
This replaces the custom layers in both models.
2024-09-24 14:27:06 +02:00
Daniël de Kok c29dc89c18
Add support for scalar FP8 weight scales (#2550)
* Add support for scalar FP8 weight scales

* Support LLM compressor FP8 checkpoints on H100

On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.

* Remove stray debug print
2024-09-24 13:57:40 +02:00
Nicolas Patry 0ff6ff60ad
Hotfixing main (#2556) 2024-09-24 11:51:14 +02:00
Nicolas Patry 74d3ce106e
Micro cleanup. (#2555) 2024-09-24 11:19:24 +02:00
Alvaro Bartolome d31a6f75cc
Remove duplicated `RUN` in `Dockerfile` (#2547) 2024-09-24 10:19:13 +02:00
OlivierDehaene 10e6f29295
chore: Add old V2 backend (#2551)
* wip

* added v2
2024-09-24 08:38:17 +02:00
Daniël de Kok 9263817c71
nix: remove unused `_server.nix` file (#2538) 2024-09-23 09:43:23 +02:00
Nicolas Patry 169178b937
Preparing for release. (#2540)
* Preparing for release.

* Upgrade version in docs.
2024-09-20 17:42:04 +02:00
OlivierDehaene 7e2d18877e
fix: wrap python basic logs in debug assertion in launcher (#2539)
* fix: wrap python basic logs in debug assertion in launcher

* use level filters instead
2024-09-20 14:59:31 +00:00
Wang, Yi f478aa77ad
hotfix: ipex fails since cuda moe kernel is not supported (#2532)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-20 10:02:55 +02:00
Daniël de Kok abd24dd385
doc: clarify that `--quantize` is not needed for pre-quantized models (#2536) 2024-09-19 22:17:15 +02:00
Daniël de Kok c103760172
Update to moe-kenels 0.3.1 (#2535)
* Update to moe-kenels 0.3.1

* Attempt to fix apt failure
2024-09-19 22:16:32 +02:00
Nicolas Patry f512021e77
Stream options. (#2533)
* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow
2024-09-19 20:50:37 +02:00
Daniël de Kok ce85efa968
Move to moe-kernels package and switch to common MoE layer (#2511)
* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner
2024-09-17 18:08:58 +02:00
OlivierDehaene 86984e3236
fix: metrics unbounded memory (#2528) 2024-09-17 16:01:28 +00:00
Daniël de Kok 71e4268600
nix: pure Rust check/fmt/clippy/test (#2525)
Runs the tests in a Nix build sandbox.
2024-09-17 12:14:30 +02:00
Nicolas Patry 38fcafcf96
Adding a test for FD. (#2516)
* Adding a test for FD.

* Fixing flashdecoding (empty batch doesn't work).

* Fixing the invalid popping.

* Fixing radix with block_size > 1

* Last reference.

* Use an actual hash.

* Update hash for slice.len() == 1

* Update the locks.

* Increasing docker timeout.
2024-09-16 17:00:54 +02:00
Daniël de Kok 7774655297
Add tests for Mixtral (#2520)
Disable by default because CI runners do not have enough GPUs.
2024-09-16 12:39:18 +02:00
Alex Strick van Linschoten 9cca3e0b03
Use `ratatui` not (deprecated) `tui` (#2521)
* use ratatui not archived tui

* bump ratatui all the way with options
2024-09-13 18:45:28 +02:00
Wang, Yi 3ac7df2b6d
hotfix : enable intel ipex cpu and xpu in python3.11 (#2517)
enable intel ipex cpu and xpu in python3.11

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-12 17:23:49 +02:00
drbh 628334d336
fix: pass missing revision arg for lora adapter when loading multiple… (#2510)
fix: pass missing revision arg for lora adapter when loading multiple adapters
2024-09-12 17:04:52 +02:00
Nicolas Patry d95c670ada
Add nix test. (#2513)
* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.
2024-09-12 14:54:56 +02:00
Daniël de Kok 94304649f1
nix: support Python tokenizer conversion in the router (#2515)
Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.
2024-09-12 10:44:01 +02:00
Nicolas Patry 69e3be20fb
Fix truffle (#2514)
* Attempting to discard the trufflehog warning.

* Attempt to fix trufflehog.
2024-09-11 22:45:19 +02:00
Nicolas Patry dae3bf1d87
Fix tokenization yi (#2507)
* Fixing odd tokenization self modifications on the Rust side (load and
resave in Python).

* Fixing the builds ?

* Fix the gh action?

* Fixing the location ?

* Validation is odd.

* Try a faster runner

* Upgrade python version.

* Remove sccache

* No sccache.

* Getting libpython maybe ?

* List stuff.

* Monkey it up.

* have no idea at this point

* Tmp.

* Shot in the dark.

* Tmate the hell out of this.

* Desperation.

* WTF.

* -y.

* Apparently 3.10 is not available anymore.

* Updating the dockerfile to make libpython discoverable at runtime too.

* Put back rust tests.

* Why do we want mkl on AMD ?

* Forcing 3.11 ?
2024-09-11 22:41:56 +02:00
Nicolas Patry a4e3e8c608
Prefix test - Different kind of load test to trigger prefix test bugs. (#2490)
* Adding prefix test.

* [WIP] tmp dump of integration load tests.

* Remove other tensor creation.

* Fixed the radix tree.

Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.

* Fix parsing

* Is it really flashinfer version ?

* Remove some comments.

* Revert the max prefix hit.

* Adding numpy to diff.

* Upgraded flashinfer.

* Upgrading some stuff.

* Are we done yet ?

* Minor fixup

* Remove 1 log and put back the other.

* Add comment for why slot 0 is OK.

* Mounting on the job.

* Get me a debug branch

* Debugging CIs is fun.

* Attempt #28

* wip

* Tmate.

* Praying.

* Updating VLM causal model with updated context.

* Important line got squashed.

* Tmate again.

* Fingers crossed.

* We want only 1 run of integration tests.....

---------

Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
2024-09-11 18:10:40 +02:00
Vallepu Vamsi Krishna eabbbbda23
Add Directory Check to Prevent Redundant Cloning in Build Process (#2486)
Update Makefile-fbgemm

Added Directory check for FBGEMM repository cloning.
2024-09-07 13:19:43 +02:00