Commit Graph

1133 Commits

Author SHA1 Message Date
Daniël de Kok 7f54b7336a
Test Marlin MoE with `desc_act=true` (#2622)
Update the Mixtral GPTQ test to use a model with `desc_act=true` and
`group_size!=-1` to ensure that we are checking activation
sorting/non-full K (with tensor parallelism). The `desc_act=false` case
is already checked by the Mixtral AWQ test.
2024-10-21 12:50:35 +02:00
Daniël de Kok 5e0fb46821
Make handling of FP8 scales more consisent (#2666)
Change `fp8_quantize` so that we can pass around reciprocals everywhere,
so scales are always passed around in the checkpoint format.

I also noticed that we ignore any input scales that we might have when
fbgemm is available. Skip this path if we already have a scale.
2024-10-19 09:05:01 +02:00
Nicolas Patry 153ff3740b
CI job. Gpt awq 4 (#2665)
* add gptq and awq int4 support in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix ci failure

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* set kv cache dtype

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* refine the code according to the review command

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Simplifying conditionals + reverting integration tests values.

* Unused import

* Fix redundant import.

* Revert change after rebase.

* Upgrading the tests (TP>1 fix changes to use different kernels.)

* Update server/text_generation_server/layers/gptq/__init__.py

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-18 17:55:53 +02:00
Daniël de Kok 8ec57558cd
Break cycle between the attention implementations and KV cache (#2627) 2024-10-17 14:54:22 +02:00
drbh 5f32dea1e2
fix: prefer inplace softmax to avoid copy (#2661)
* fix: prefer inplace softmax to avoid copy

* Update server/text_generation_server/models/flash_causal_lm.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-17 08:49:02 -04:00
oOraph 1b97e084bf
fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process (#2663)
tgi-entrypoint: exec instead of spawning a child process

reason: otherwise parent will receive the signals when we'd like tgi to receive them
keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child

Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com>
Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>
2024-10-17 11:15:26 +02:00
Daniël de Kok 59ea38cbca
Simplify the `attention` function (#2609)
* Simplify the `attention` function

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).

* Fixup flashinfer support
2024-10-17 10:42:52 +02:00
Daniël de Kok 5bbe1ce028
Support `e4m3fn` KV cache (#2655)
* Support `e4m3fn` KV cache

* Make check more obvious
2024-10-17 10:42:16 +02:00
OlivierDehaene a6a0c97ed9
feat: prefill chunking (#2600)
* wip

* rollback

* refactor to use prefix/postfix namming + fix all_input_ids_tensor

* maybe patching vlms?

* fix filter and concat

* wip, no filter, no concat

* current

* add prepare_for_prefill

* working

* load tested

* re-create slots

* re-create slots

* fix slot_filtering_indices

* feedback loop

* remove log

* fix benchmarker

* fix vlm and seq2seq

* rename to cache and input lengths

* fix prefill logprobs

* fix launcher

* fix logprobs?

* idk at this point

* max input length

* omfg

* remove debugging lines

* fix tests

* fix mllama

* fix cargo tests

* remove support chunking for paged

* Fixing non blocked attentions

* Fixing dtype + AMD, Ipex targets.

* lint fix.

* rename

* Fix prefix_caching variable, remove defaults in server (confusing a lot
of the times).

* Add simple resolution when user specifies ATTENTION=paged.

* Put back non default simple tests.

* Fix env name

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-16 12:49:33 +02:00
Mohit Sharma 704a58c807
Fp8 e4m3_fnuz support for rocm (#2588)
* (feat) fp8 fnuz support for rocm

* (review comments) Fix compression_config load, type hints

* (bug) update all has_tensor

* (review_comments) fix typo and added comments

* (nit) improved comment
2024-10-16 09:54:50 +02:00
Alvaro Bartolome ffe05ccd05
Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651)
As spotted by @philschmid, the payload was compliant with Vertex AI, but
just partially, since ideally the most compliant version would be with
the generation kwargs flattened to be on the same level as the
`messages`; meaning that Vertex AI would still expect a list of
instances, but each instance would be an OpenAI-compatible instance,
which is more clear; and more aligned with the SageMaker integration
too, so kudos to him for spotting that; and sorry from my end for any
inconvenience @Narsil.
2024-10-15 18:11:59 +02:00
Daniël de Kok ce7e356561 Use flashinfer for Gemma 2. 2024-10-15 13:49:32 +00:00
Nicolas Patry cf04a43fb1
Fixing linters. (#2650) 2024-10-15 12:43:49 +02:00
Dmitry Rogozhkin 58848cb471
feat: enable pytorch xpu support for non-attention models (#2561)
XPU backend is available natively (without IPEX) in pytorch starting
from pytorch 2.4. This commit extends TGI to cover the case when user
has XPU support thru pytorch 2.4, but does not have IPEX installed.
Models which don't require attention can work. For attention required
models more work is needed to provide attention implementation.

Tested with the following models:
* teknium/OpenHermes-2.5-Mistral-7B
* bigscience/bloom-560m
* google/gemma-7b
* google/flan-t5-xxl

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2024-10-14 18:28:49 +02:00
Wang, Yi 7a82ddcbd0
update ipex to fix incorrect output of mllama in cpu (#2640)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-14 16:32:33 +02:00
Omar Sanseviero 51f5401893
Clarify gated description and quicktour (#2631)
Update quicktour.md
2024-10-14 16:31:37 +02:00
Nicolas Patry 3ea82d008c
Cpu perf (#2596)
* break when there's nothing to read

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Different approach, only listen on stdin when `LOG_LEVEL=debug` (which
is where dropping to a debugger is important).

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-14 15:34:08 +02:00
Omar Sanseviero ce28ee88d5
Small fixes for supported models (#2471)
* Small improvements for docs

* Update _toctree.yml

* Updating the doc (we keep the list actually).

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-14 15:26:39 +02:00
Nicolas Patry 0c478846c5
Fixing intel Supports windowing. (#2637) 2024-10-11 21:47:03 +02:00
Nicolas Patry 3dbdf63ec5
Intel ci (#2630)
* Intel CI ?

* Let's try non sharded gemma.

* Snapshot rename

* Apparently container can be gone already.
2024-10-10 16:51:57 +02:00
vb d912f0bf55
Update documentation to most recent stable version of TGI. (#2625)
Update to most recent stable version of TGI.
2024-10-10 16:00:25 +02:00
drbh e36dfaa8de
feat: allow tool calling to respond without a tool (#2614)
* feat: process token stream before returning to client

* fix: expect content in test

* fix: improve comparison via ruff lint

* fix: return event in all cases

* fix: always send event on error, avoid unwraps, refactor and improve tests

* fix: prefer no_tool over notify_error to improve reponse

* fix: adjust chat input test for no_tool

* fix: adjust test expected content

---------

Co-authored-by: System administrator <root@ip-10-90-0-186.ec2.internal>
2024-10-10 09:28:25 -04:00
Nicolas Patry 43f39f6894
AMD CI (#2589)
* Only run 1 valid test.

* TRying the tailscale action quickly.

* ?

* bash spaces.

* Remove tailscale.

* More quotes.

* mnt2 ?

* Othername to avoid recursive directories.

* Good old tmate.

* Remove tmate.

* Trying a few things.

* Remove some stuff.

* Sleep ?

* Tmp

* busybox

* Launcher tgi

* Starting hello

* Busybox in python

* No device.

* Removing all variables ?

* A un moment donné.

* Tmp

* Tmp2

* DEvice request, no container name

* No device requests

* Without pytest.

* No pytest.

* from env

* Start with devices

* Attemp #1

* Remove stdin messing

* Only 1 test, no container name

* Raw tgi

* Sending args.

* Show pip freeze.

* Start downloading with token

* Giving HIP devices.

* Mount volume + port forward

* Without pytest.

* No token

* Repeated arguments

* Wrong kwarg.

* On 2 GPUs

* Fallback to single shard CI test.

* Testing

* yaml

* Common cache ?

* Trailing slash ?

* Docker volume split.

* Fix docker volume

* Fixing ?

* ?

* Try no devices ?

* Flash llama on intel CPU ?

* Fix nvidia ?

* Temp deactivate intel, activate nvidia ?
2024-10-09 17:50:49 +02:00
Daniël de Kok 9ed0c85fe1
nix: add black and isort to the closure (#2619)
To make sure that everything is formatted with the same black version
as CI.

I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.
2024-10-09 11:08:02 +02:00
drbh 8ad20daf33
CI (2599): Update ToolType input schema (#2601)
* Update ToolType input schema

* lint

* fix: run formatter

* fix: allow tool choide to be null

---------

Co-authored-by: Wauplin <lucainp@gmail.com>
2024-10-08 12:35:48 -04:00
Daniël de Kok 6db3bcb700
nix: move back to the tgi-nix main branch (#2620) 2024-10-08 12:55:05 +02:00
Daniël de Kok 64142489b6
Add support for fused MoE Marlin for AWQ (#2616)
* Add support for fused MoE Marlin for AWQ

This uses the updated MoE Marlin kernels from vLLM.

* Add integration test for AWQ MoE
2024-10-08 11:56:41 +02:00
Nicolas Patry 8b295aa498
Upgrade minor rust version (Fixes rust build compilation cache) (#2617)
* Upgrade minor rust version (Fixes rust build compilation cache)

* Black
2024-10-08 09:42:50 +02:00
Wang, Yi 57f9685dc3
enable mllama in intel platform (#2610)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-07 21:15:09 +02:00
Florian Zimmermeister 0da4df4b96
Fix FP8 KV-cache condition (#2611)
Update kv_cache.py
2024-10-07 09:34:19 +02:00
Daniël de Kok 2358c2bb54
Add basic FP8 KV cache support (#2603)
* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml
2024-10-04 17:51:48 +02:00
Daniël de Kok 68103079f4
nix: example of local package overrides during development (#2607) 2024-10-04 16:52:42 +02:00
drbh 3011639ff7
Revert "Unroll notify error into generate response" (#2605)
Revert "Unroll notify error into generate response (#2597)"

This reverts commit d22b0c1fbe.
2024-10-03 17:56:40 -04:00
Nicolas Patry f6e2f05b16
New release 2.3.1 (#2604)
* New release 2.3.1

* Update doc number
2024-10-03 14:43:49 +02:00
drbh d22b0c1fbe
Unroll notify error into generate response (#2597)
* feat: unroll notify_error if no tool is choosen

* fix: expect simple message when no tool is selected

* fix: improve test to avoid notify_error

* fix: improve docs and indicate change in expected response

* fix: adjust linting in test file
2024-10-02 11:34:57 -04:00
drbh 2335459556
CI (2592): Allow LoRA adapter revision in server launcher (#2602)
allow revision for lora adapters from launcher

Co-authored-by: Sida <sida@kulamind.com>
Co-authored-by: teamclouday <teamclouday@gmail.com>
2024-10-02 10:51:04 -04:00
Nicolas Patry 0204946d26
Max token capacity metric (#2595)
* adding max_token_capacity_metric

* added tgi to name of metric

* Adding max capacity metric.

* Add description for the metrics

---------

Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
2024-10-02 16:32:36 +02:00
Nicolas Patry d18ed5cfc5
Mllama flash version (#2585)
* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0
2024-10-02 11:22:13 +02:00
Daniël de Kok 584b4d7a68
nix: experimental support for building a Docker container (#2470)
* nix: experimental support for building a Docker image

Run using something like:

```
docker run \
  --device nvidia.com/gpu=all \
  -it --rm -p 8080:80 \
  -v $PWD/data:/data \
  -v $PWD/tmp:/tmp \
  tgi-docker:latest \
  --model-id <model_id>
```

* Example of building the Docker image using Nix inside Docker

* Stream to make the builder image smaller

This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.

* Don't spam journalctl on Linux

* Other dockerfile.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-10-01 18:02:06 +02:00
Daniël de Kok 1c84a30fe6
MoE Marlin: support `desc_act` for `groupsize != -1` (#2590)
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
2024-09-30 19:40:25 +02:00
Daniël de Kok d1f257ac56
Move flake back to tgi-nix `main` (#2586) 2024-09-30 11:39:41 +02:00
drbh 93a7042d7e
feat: support phi3.5 moe (#2479)
* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-30 11:15:09 +02:00
Daniël de Kok 90a1d04a2f
Add support for GPTQ-quantized MoE models using MoE Marlin (#2557)
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
2024-09-30 11:14:32 +02:00
Mohit Sharma f9e561eced
Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-09-30 10:54:32 +02:00
Ikram Ul Haq e790cfc0e4
Update architecture.md (#2577) 2024-09-30 08:56:20 +02:00
Daniël de Kok afc7ded84f
Remove compute capability lazy cell (#2580)
Remove compute capability lock

We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
2024-09-30 08:48:47 +02:00
Daniël de Kok 1028996fb3
flashinfer: pass window size and dtype (#2574) 2024-09-28 18:41:41 +02:00
Daniël de Kok 5b6b74e21d
Improve support for GPUs with capability < 8 (#2575)
* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s
2024-09-27 16:19:42 +02:00
Alvaro Bartolome 0aa66d693a
Fix build with `--features google` (#2566)
* Fix `cargo build --features google`

* Add `cargo test --features google`
2024-09-26 11:41:38 +02:00
Alvaro Bartolome 0b7df77178
Add LoRA adapters support for Gemma2 (#2567)
* Add LoRA adapters support for Gemma2

* Make `black` formatting happy
2024-09-26 10:54:08 +02:00