Update the Mixtral GPTQ test to use a model with `desc_act=true` and
`group_size!=-1` to ensure that we are checking activation
sorting/non-full K (with tensor parallelism). The `desc_act=false` case
is already checked by the Mixtral AWQ test.
* add gptq and awq int4 support in intel platform
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix ci failure
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* set kv cache dtype
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* refine the code according to the review command
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Simplifying conditionals + reverting integration tests values.
* Unused import
* Fix redundant import.
* Revert change after rebase.
* Upgrading the tests (TP>1 fix changes to use different kernels.)
* Update server/text_generation_server/layers/gptq/__init__.py
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: process token stream before returning to client
* fix: expect content in test
* fix: improve comparison via ruff lint
* fix: return event in all cases
* fix: always send event on error, avoid unwraps, refactor and improve tests
* fix: prefer no_tool over notify_error to improve reponse
* fix: adjust chat input test for no_tool
* fix: adjust test expected content
---------
Co-authored-by: System administrator <root@ip-10-90-0-186.ec2.internal>
To make sure that everything is formatted with the same black version
as CI.
I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.
* Add basic FP8 KV cache support
This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:
* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.
* Fix Cargo.toml
* feat: unroll notify_error if no tool is choosen
* fix: expect simple message when no tool is selected
* fix: improve test to avoid notify_error
* fix: improve docs and indicate change in expected response
* fix: adjust linting in test file
* Working loading state.
* Preprocessing.
* Working state ? (Broke idefics1 temporarily).
* Cleaner condition.
* Fix idefics.
* Updating config, removing TODO
* Mllama
* Ugrade transformers 4.45
* Flashing mllama.
* Starting to get there.
* Working state.
* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.
* Updating model link.
* Earlier assert.
* Fix vlm ?
* remove log.
* Force ignore all images but last.
* Default dtype bfloat16.
* Update integration test after switch to bf16.
* Remove dead code.
* Removed dead code.
* Upgrade the flake to latest transformers/tokenizers
* Move to hf tgi-nix
* Upgrade to 0.5.0
* feat: support phi3.5 moe model loading
* fix: prefer llama base model and improve rotary logic
* feat: return reasonable generation and add integration test
* fix: run lint and update docs
* fix: rerun lint for openapi docs
* fix: prefer do_sample false unless temp is set by user, and update chat tests
* fix: small typo adjustments
* fix: consolidate long rope paths
* fix: revert greedy by default and test changes
* Vendor configuration so that we don't have to `trust_remote_code`
* Use SparseMoELayer
* Add support for dense MoE
* Some type annotations
* Add the usual model tests
* Ruff.
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:
- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow
* Move to moe-kernels package and switch to common MoE layer
This change introduces the new `moe-kernels` package:
- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
models.
- Port over Mixtral and Deepseek.
* Make `cargo check` pass
* Update runner
* Adding a test for FD.
* Fixing flashdecoding (empty batch doesn't work).
* Fixing the invalid popping.
* Fixing radix with block_size > 1
* Last reference.
* Use an actual hash.
* Update hash for slice.len() == 1
* Update the locks.
* Increasing docker timeout.
* Adding prefix test.
* [WIP] tmp dump of integration load tests.
* Remove other tensor creation.
* Fixed the radix tree.
Used a slice everywhere in radix.rs to keep the cheap Arc cloning
instead of recomputing the input_ids.
* Fix parsing
* Is it really flashinfer version ?
* Remove some comments.
* Revert the max prefix hit.
* Adding numpy to diff.
* Upgraded flashinfer.
* Upgrading some stuff.
* Are we done yet ?
* Minor fixup
* Remove 1 log and put back the other.
* Add comment for why slot 0 is OK.
* Mounting on the job.
* Get me a debug branch
* Debugging CIs is fun.
* Attempt #28
* wip
* Tmate.
* Praying.
* Updating VLM causal model with updated context.
* Important line got squashed.
* Tmate again.
* Fingers crossed.
* We want only 1 run of integration tests.....
---------
Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>
* Making prefix/flashinfer the default and testing the full release tests.
* Include flashinfer in the docker.
* Using prebuilt.
* Allowing window_left_size (dummy version).
* Disabling flashinfer/prefix caching on odd head_dim
* Disable prefix caching for lora.
* More specific codes.
* Update lock
* Updating integration tests with new values with FI/FD.
Remove paged as a default too, and using FD everywhere.
* Update cargo lock ?
* Upgrade to 1.80 because of bitstream...
* Everywhere 1.80
* Forgot last default place.
* Apply suggestions from code review
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Updated flake lock
* Tmp
* Upgrade resolution system for less errors in resolution.
* Remove lambda for cleaner function.
* Handling debugger.
* OVerride the env in server tests.
* Is this enough to make it work ?
* This seems to be working.
* Downgrade some logs.
* Fixing the default for vlm.
* Don't enable prefix caching on VLM just yet.
* Change `add_special_tokens` in order to have the correct tokens for chat
input and not (since it's super important with the prefixing now)
* Fixing prefix caching for flashdecoding.
* Update all models.
* Fixed flashinfer version.
* add_special_tokens is internal only
* Fixing seqlen with the new vlms.
* Fixing the issue with `add_special_tokens` not being passed around.
* Fixing the test.
* Removing encoder_decoder (seq2seq).
* Update the chat test.
* Fixing the batching tokenization in flash causal lm.
* Truncating left for radix purposes.
* Oops this doesn't belong here.
* Put back default pure shell.
* Update server tests
- Default to throughput test in k6
- Use TGI_WIGGLE_ROOM to adjust wiggle room
* Only n_heads / process_group.size() are necessary.
* Revert the integrationt tests change (seem linked to head_size
modification).
* Adding error message when assert is violated.
* Fixing the free algorithm to handle times where the common prefix is
smaller.
* Apply suggestions from code review
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Update server/text_generation_server/layers/attention/common.py
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* Fix disabling prefix caching - Fix windowing checks.
* Revert the Cohere tokenizer change (for now using a revision instead).
* Fmt.
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
* All integration tests back everywhere (too many failed CI).
* Upgrade integration tests after 12.4
* Attempt to remove the specifed compute cap.
* Common arch list.
* Punica uses raw ASM which is not valid on 9.0 apparently.
* Fixing exl2 and other quanize tests again.
* Mark exl2 as non release (so CI tests them, needs to be removed latet).
* Fixing exl2 (by disabling cuda graphs)
* Fix quantization defaults without cuda graphs on exl2 (linked to new
issues with it).
* Removing serde override.
* Go back to released exl2 and remove log.
* Adding warnings for deprecated bitsandbytes + upgrade info to warn.
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0
* Update poetry lock file
* Fix small PaliGemma logprob differences after the torch update
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:
- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
So, we need weight loads that supports quantized weights. To this
end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
fork and we need to ensure that the KV cache is allocated with the
correct size.
- Shared experts.
* Improve the handling of quantized weights
Handling of quantized weights was split between two mechanisms:
- For quantized checkpoints, we used the new weight loader
infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
instead relied on conditional in `get_linear`.
Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.
This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:
- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
`get_linear` does not need to know how to handle quantizer linear
layers.
- All quantizer weights are strongly typed, we don't pass around
raw tensors.
- We don't have to pass around the `quantizer` string everywhere.
* Exclude non-MLP layers when using FP8 quantization with Llama
* feat: simple mistral lora integration tests
* fix: include args in docker launcher
* fix: disable cuda graphs with lora and warn
* fix: adjust docs and precommit issues
* fix: re update docs
* Add more representative Llama GPTQ test
The Llama GPTQ test is updated to use a model with the commonly-used
quantizer config format and activation sorting. The old test is
kept around (but renamed) since it tests the format produced by
`text-generation-server quantize`.
* Add support for manually triggering a release build
* Refactor dead code.
* First working step.
* Remove a lot of duplicated code.
* More dead code.
* More cleanup.
* Fix Santacoder test.
* Fixing the simple tests.
* Fixing sharding.
* Fixes for VLM.
* Fixing santacoder (num_kv_heads hardcoded).
* Removing more dead code.
* Fixing `config.n_head`.
* Stopping earlier because of `<end_of_utterance>` in idefics2.
* Addresses comments.
* Removing the dead code.
* Fuse back mistral into FlashCausalLM.
* Finish removal.
* Fixing docs + causal_lm `batch_class`.
* Fixing docs + causal.lm.
* Add default to Gemma Causality.
* Default value for gemma/gemma2.
* Wrong default.
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.
For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes#2029.
While at it, also remove all the image token handling duplication
in `prepare_input`.
* Add pytest release marker
Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.
* Mark many models as `release` to speed up CI
* New runner. Manual squash.
* Network host.
* Put back trufflehog with proper extension.
* No network host ?
* Moving buildx install after tailscale ?
* 1.79