* Add support for compressed-tensors w8a8 int checkpoints
This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.
Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100|
| | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101|
|ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173|
| | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187|
Which is the same ballpark as vLLM.
As usual, lots of thanks to Neural Magic/vLLM for the kernels.
* Always use dynamic input quantization for w8a8 int
It's far less flaky and gives better output.
* Use marlin-kernels 0.3.5
* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Small fixes
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Remove vLLM dependency for CUDA
This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.
Tested run (since we don't have paged attention in CI):
```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```
* Fix clippy warning
* feat: return streaming errors as an event formatted for openai's client
* fix: propagate completions error events to stream
* fix: improve stream api error format and add status code
* fix: improve streamin error to include error_type
* Revert "fix: improve streamin error to include error_type"
This reverts commit 2b1a360b15.
* Reworked the implementation.
* Revert "Reworked the implementation."
This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.
* Small lifting.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Upgrade outlines to 0.1.1
* Update for new API
* Check if allowed tokens is None
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because
- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
quantizers.
- Configurable exclusions for quantization.
This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.
The following types of quantization are supported in this PR:
- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
Support for other quantization types will be added in subsequent PRs.
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
* fix: only check model type if config exists
* fix: adjust sharding and lm head logic
* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix: return correct shape logits and add streaming test
* fix: remove unused import and refactor test
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: add support for qwen2 vl model
* feat: fix token padding, enable warmup and process basic request
* fix: improve get_position_ids, add lift embed_tokens
* fix: remove get_cos_sin_hack dev function
* feat: add simple test chat with meesage and text
* fix: lint test
* fix: adjust positional embeddings for multi dimensional position ids
* fix: update docs and lint unused vars
* fix: include linted file
* fix: add norm after text output
* fix: format model file
* fix: adjust for ruff lints
* fix: remove unused rotate_half
* feat: refactors and calc num features
* fix: prefer position_ids passed from vlm causal lm and reset ids on batch
* fix: adjust get_position_ids if not available and add required args to signatures
* fix: adjust resize case for qwen2_vl warmup
* fix: avoid qwen2 vl specific paths with qwen2
add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* We can have a tokenizer anywhere.
* Handling potential lack of offsets (python tokenizer)
* Remove redundancy.
* Fixing the tests.
* Flake.lock update ?
* Fixing the GIL locking.
* Fixing mamba by using the transformers version.
* Adding the legacy handle.
* Ellide lifetime.
* Lint.
* Deprecation message.
* Fixing bad rebase.
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels
Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.
* Update test snapshots
* feat(trtllm): rewrite health to not account for current state
* chore(looper): cleanup a bit more
* feat(post_processing): max_new_tokens is const evaluated now
* chore(ffi):formatting
* feat(trtllm): add stop words handling
# Conflicts:
# backends/trtllm/lib/backend.cpp
* chore(trtllm): create specific parallelconfig factory and logging init methods
* chore(trtllm): define a macro for SizeType cast
* chore(trtllm): use GetParallelConfig
* chore(trtllm): minor refactoring
* chore(trtllm): validate there are enough GPus on the system for the desired model
* chore(trtllm): ensure max throughput scheduling policy is selected
* chore(trtllm): minor fix
* chore(router): minor refactorings
* feat(docker): build with-slurm ompi
* feat(docker): add python3.10 dev to runtime deps
* chore(docker): add mpi to ld_library_path
* chore(docker): install transformers
* feat(trtllm): detect stop_words from generation_config.json