Mohit Sharma
8cc2febdb6
(fix) quantize=fp8
2024-09-30 12:07:38 +00:00
Mohit Sharma
8ee9823d3b
(feat) fp8 fnuz support for rocm
2024-09-30 11:43:45 +00:00
Mohit Sharma
2401fdc889
cleaned dockerfile
2024-09-30 03:40:00 +00:00
Mohit Sharma
3b28cf9067
improve dockerfile
2024-09-28 15:54:45 +00:00
Mohit Sharma
7cb49f6f4f
float16 dep
2024-09-27 15:53:44 +00:00
Mohit Sharma
b2cd1b66ed
fix imports after rebase
2024-09-27 15:52:43 +00:00
Mohit Sharma
473d9a892d
Merge remote-tracking branch 'upstream/main' into rocm_6.2_updates
2024-09-27 15:36:12 +00:00
Daniël de Kok
5b6b74e21d
Improve support for GPUs with capability < 8 ( #2575 )
...
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s
2024-09-27 16:19:42 +02:00
Mohit Sharma
346dfe398a
remove import
2024-09-27 12:59:35 +00:00
Mohit Sharma
a24c2cc5e9
updated default value
2024-09-27 12:39:12 +00:00
Mohit Sharma
ac2dccd174
improved error messag
2024-09-27 12:34:04 +00:00
Mohit Sharma
816d4b67b2
fix import
2024-09-27 12:32:17 +00:00
Mohit Sharma
47c81d2924
Merge remote-tracking branch 'upstream/main' into fix_rocm_fa
2024-09-27 10:34:16 +00:00
Mohit Sharma
829144d15a
addressed review comments
2024-09-27 10:28:37 +00:00
Alvaro Bartolome
0aa66d693a
Fix build with `--features google` ( #2566 )
...
* Fix `cargo build --features google`
* Add `cargo test --features google`
2024-09-26 11:41:38 +02:00
Alvaro Bartolome
0b7df77178
Add LoRA adapters support for Gemma2 ( #2567 )
...
* Add LoRA adapters support for Gemma2
* Make `black` formatting happy
2024-09-26 10:54:08 +02:00
Nicholas Broad
7efcb5e0ed
remove LORA_ADAPTERS_PATH ( #2563 )
...
specify how to call local adapters
2024-09-25 01:20:15 +02:00
Nicolas Patry
dd8691b7c5
More tensor cores. ( #2558 )
...
* More tensor cores.
* Fixing the logic.
* Gemma is modified by this.
2024-09-24 23:57:26 +02:00
Nicolas Patry
c032280b17
Cleanup Vertex + Chat ( #2553 )
...
* Cleanup Vertex + Chat
* logprobs defaults to false.
* Parameters are optional
* Fix docs.
* Changing back this logprobs default.
* Fixup doc.
* Let's debug that.
* Not unstable.
* Updating Cargo ?
* Wat?
* Dummy change.
* Trying some other install.
* Trying smething.
* Revert everything.
* Update Cargo lock.
* Fixing the pre-commit after rebase.
2024-09-24 23:37:17 +02:00
Nicolas Patry
75c8c54ac9
Hotfixing main. ( #2562 )
2024-09-24 23:00:43 +02:00
Aritra Roy Gosthipaty
e6d29656b5
Adding note for private models in quick-tour document ( #2548 )
...
* chore: adding note for private models in quicktour doc
* Update docs/source/quicktour.md
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>
* Update docs/source/quicktour.md
Co-authored-by: vb <vaibhavs10@gmail.com>
---------
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
2024-09-24 15:06:53 +02:00
Orhun Parmaksız
8024ded58f
Simplify crossterm imports ( #2545 )
2024-09-24 14:57:20 +02:00
Orhun Parmaksız
03263f5e88
Update the link to the Ratatui organization ( #2546 )
2024-09-24 14:51:48 +02:00
Daniël de Kok
3f14cd1420
Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 ( #2537 )
...
This replaces the custom layers in both models.
2024-09-24 14:27:06 +02:00
Daniël de Kok
c29dc89c18
Add support for scalar FP8 weight scales ( #2550 )
...
* Add support for scalar FP8 weight scales
* Support LLM compressor FP8 checkpoints on H100
On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.
* Remove stray debug print
2024-09-24 13:57:40 +02:00
Mohit Sharma
64e981fdcf
fix issue for sliding window models
2024-09-24 10:53:19 +00:00
Nicolas Patry
0ff6ff60ad
Hotfixing main ( #2556 )
2024-09-24 11:51:14 +02:00
Nicolas Patry
74d3ce106e
Micro cleanup. ( #2555 )
2024-09-24 11:19:24 +02:00
Alvaro Bartolome
d31a6f75cc
Remove duplicated `RUN` in `Dockerfile` ( #2547 )
2024-09-24 10:19:13 +02:00
OlivierDehaene
10e6f29295
chore: Add old V2 backend ( #2551 )
...
* wip
* added v2
2024-09-24 08:38:17 +02:00
Daniël de Kok
9263817c71
nix: remove unused `_server.nix` file ( #2538 )
2024-09-23 09:43:23 +02:00
Nicolas Patry
169178b937
Preparing for release. ( #2540 )
...
* Preparing for release.
* Upgrade version in docs.
2024-09-20 17:42:04 +02:00
OlivierDehaene
7e2d18877e
fix: wrap python basic logs in debug assertion in launcher ( #2539 )
...
* fix: wrap python basic logs in debug assertion in launcher
* use level filters instead
2024-09-20 14:59:31 +00:00
Mohit Sharma
21d1b0cd8b
fix conflict
2024-09-20 08:59:17 +00:00
Wang, Yi
f478aa77ad
hotfix: ipex fails since cuda moe kernel is not supported ( #2532 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-20 10:02:55 +02:00
Daniël de Kok
abd24dd385
doc: clarify that `--quantize` is not needed for pre-quantized models ( #2536 )
2024-09-19 22:17:15 +02:00
Daniël de Kok
c103760172
Update to moe-kenels 0.3.1 ( #2535 )
...
* Update to moe-kenels 0.3.1
* Attempt to fix apt failure
2024-09-19 22:16:32 +02:00
Nicolas Patry
f512021e77
Stream options. ( #2533 )
...
* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow
2024-09-19 20:50:37 +02:00
Mohit Sharma
4fb947d2aa
fixed style
2024-09-19 14:28:21 +00:00
Mohit Sharma
e6d07a6d34
euff
2024-09-18 12:03:52 +00:00
Daniël de Kok
ce85efa968
Move to moe-kernels package and switch to common MoE layer ( #2511 )
...
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner
2024-09-17 18:08:58 +02:00
OlivierDehaene
86984e3236
fix: metrics unbounded memory ( #2528 )
2024-09-17 16:01:28 +00:00
Daniël de Kok
71e4268600
nix: pure Rust check/fmt/clippy/test ( #2525 )
...
Runs the tests in a Nix build sandbox.
2024-09-17 12:14:30 +02:00
Nicolas Patry
38fcafcf96
Adding a test for FD. ( #2516 )
...
* Adding a test for FD.
* Fixing flashdecoding (empty batch doesn't work).
* Fixing the invalid popping.
* Fixing radix with block_size > 1
* Last reference.
* Use an actual hash.
* Update hash for slice.len() == 1
* Update the locks.
* Increasing docker timeout.
2024-09-16 17:00:54 +02:00
Daniël de Kok
7774655297
Add tests for Mixtral ( #2520 )
...
Disable by default because CI runners do not have enough GPUs.
2024-09-16 12:39:18 +02:00
Alex Strick van Linschoten
9cca3e0b03
Use `ratatui` not (deprecated) `tui` ( #2521 )
...
* use ratatui not archived tui
* bump ratatui all the way with options
2024-09-13 18:45:28 +02:00
Mohit Sharma
4ba9210f91
fix docker
2024-09-12 15:45:06 +00:00
Wang, Yi
3ac7df2b6d
hotfix : enable intel ipex cpu and xpu in python3.11 ( #2517 )
...
enable intel ipex cpu and xpu in python3.11
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-09-12 17:23:49 +02:00
drbh
628334d336
fix: pass missing revision arg for lora adapter when loading multiple… ( #2510 )
...
fix: pass missing revision arg for lora adapter when loading multiple adapters
2024-09-12 17:04:52 +02:00
Mohit Sharma
59fd0cbdff
add skinny kernel and merge fixes
2024-09-12 13:16:13 +00:00