* nix: build and cache all devshells
* nix: add poetry to the impure shell
This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.
* Fix Nix build, disable pure shell (covered by Nix tests)
* add OpenAI like tool_choice for named choice
* add tests
* fix: run linter and bump api docs
* fix: consolidate changes and remove old tool type
* feat: improve, simplify and rename tool choice struct add required support and refactor
* fix: simplify tool choice logic, improve tests, openapi and rust docs
* fix: refactor away prepare_chat_input and improve tool grammar apply control flow
* feat: update docs and add tool choice configuration section
* fix: simplify naming, tool choice default and improve test
* fix: adjust tool choice none logic, add test and small refactors
* fix: add missing snapshot file
* fix: adjust tool choice type in test
* fix: adjust default when json tool choice is
* fix: remove trailing space lint after rebase
* fix: remove mostly mocked unit test
---------
Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>
* Add support for compressed-tensors w8a8 int checkpoints
This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.
Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100|
| | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101|
|ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173|
| | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187|
Which is the same ballpark as vLLM.
As usual, lots of thanks to Neural Magic/vLLM for the kernels.
* Always use dynamic input quantization for w8a8 int
It's far less flaky and gives better output.
* Use marlin-kernels 0.3.5
* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* Small fixes
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
* add ipex moe implementation to support Mixtral and PhiMoe
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update to ipex xpu 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* torch has xpu support in 2.5
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix oneapi basekit version
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Apply suggestions from code review
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Remove vLLM dependency for CUDA
This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.
Tested run (since we don't have paged attention in CI):
```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```
* Fix clippy warning
* feat: return streaming errors as an event formatted for openai's client
* fix: propagate completions error events to stream
* fix: improve stream api error format and add status code
* fix: improve streamin error to include error_type
* Revert "fix: improve streamin error to include error_type"
This reverts commit 2b1a360b15.
* Reworked the implementation.
* Revert "Reworked the implementation."
This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.
* Small lifting.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* Upgrade outlines to 0.1.1
* Update for new API
* Check if allowed tokens is None
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because
- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
quantizers.
- Configurable exclusions for quantization.
This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.
The following types of quantization are supported in this PR:
- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
Support for other quantization types will be added in subsequent PRs.
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
* fix: only check model type if config exists
* fix: adjust sharding and lm head logic
* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix: return correct shape logits and add streaming test
* fix: remove unused import and refactor test
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* feat: add support for qwen2 vl model
* feat: fix token padding, enable warmup and process basic request
* fix: improve get_position_ids, add lift embed_tokens
* fix: remove get_cos_sin_hack dev function
* feat: add simple test chat with meesage and text
* fix: lint test
* fix: adjust positional embeddings for multi dimensional position ids
* fix: update docs and lint unused vars
* fix: include linted file
* fix: add norm after text output
* fix: format model file
* fix: adjust for ruff lints
* fix: remove unused rotate_half
* feat: refactors and calc num features
* fix: prefer position_ids passed from vlm causal lm and reset ids on batch
* fix: adjust get_position_ids if not available and add required args to signatures
* fix: adjust resize case for qwen2_vl warmup
* fix: avoid qwen2 vl specific paths with qwen2
add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* We can have a tokenizer anywhere.
* Handling potential lack of offsets (python tokenizer)
* Remove redundancy.
* Fixing the tests.
* Flake.lock update ?
* Fixing the GIL locking.
* Fixing mamba by using the transformers version.
* Adding the legacy handle.
* Ellide lifetime.
* Lint.
* Deprecation message.
* Fixing bad rebase.
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels
Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.
* Update test snapshots
* feat(trtllm): rewrite health to not account for current state
* chore(looper): cleanup a bit more
* feat(post_processing): max_new_tokens is const evaluated now
* chore(ffi):formatting
* feat(trtllm): add stop words handling
# Conflicts:
# backends/trtllm/lib/backend.cpp
* chore(trtllm): create specific parallelconfig factory and logging init methods
* chore(trtllm): define a macro for SizeType cast
* chore(trtllm): use GetParallelConfig
* chore(trtllm): minor refactoring
* chore(trtllm): validate there are enough GPus on the system for the desired model
* chore(trtllm): ensure max throughput scheduling policy is selected
* chore(trtllm): minor fix
* chore(router): minor refactorings
* feat(docker): build with-slurm ompi
* feat(docker): add python3.10 dev to runtime deps
* chore(docker): add mpi to ld_library_path
* chore(docker): install transformers
* feat(trtllm): detect stop_words from generation_config.json
* (backend) use parking_lot crate for RwLock fairness
# Conflicts:
# backends/trtllm/src/backend.rs
* (launcher) default new server::run parameters to false for now
* (chore) fmt ... why?
* (ffi) use const for GetSamplingConfig
* (server) expose new SchedulingError
* (trt)
* (build) setup ccache if available
* (ffi) add max_new_tokens parameters
* (backend) cleanup a bit
* (backend) expose PullNewTokens
* (ffi) cleanup again
* (ffi) add missing headers imports
* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>
* (looper) new looper initial implementation
* (ffi) remove narrowing type warning
* (ffi) encode the provided user prompt within each request thread
* (misc) change scope identifiers
* (backend) implement the post_processor background thread
* (misc) missing Result types for Rust
* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step
* (server) forward auth_token to server::run
* (build) fetchcontent use archives instead of git
* (ffi) fix usage of wrong vector constructor making a capacity fill call
* (ffi) missing namespace for tle::Response
* (ffi) do not use reference capture in lambda as we are not capturing anything
* (backend) refactor & cleanup
* (Dockerfile.trtllm) delete for now
* (misc) simplify [make_]move_iterator by using c++20 type inference
* (misc) no need to move for uint32_t items
* (scheduler) rework submit/pull logic
* (post) impl postprocessing
* (misc) delete backend.rs
* (misc) rerun-if-changed all the cmake modules
* (misc) move to latest trtllm
* (fix): HOPPER_SM_MAJOR is 9 not 8
* (misc: build for sm_{75,80,86,89,90} by default
* (misc): build with trtllm 0.13.0
* (misc): increase verbosity of spdlog
* (fix): do not recreate the stateful hashmap at every it
* (misc): update dependency in trtllm dockerfile
* (misc): update dependency in trtllm dockerfile
* (misc): disable logging in release mode
* (misc): improve trtllm download script robustness
* (fix): ore fixes for Dockerfile
* misc(cuda): require 12.6
* chore(cmake): use correct policy for download_timestamp
* feat(looper): check engine and executorWorker paths exist before creating the backend
* chore(cmake): download timestamp should be before URL
* feat(looper): minor optimizations to avoid growing too much the containers
* chore(trtllm): move dockerfile to right place
* chore(trtllm): disable tokenizer parallelism by default
* chore(trtllm): fmt
* chore(trtllm): post-rebase commit
* chore(trtllm): remove unused method
* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime
* misc(router): remove SchedulingError
* feat(trtllm): do not tokenize twice
* Revert "chore(trtllm): remove unused method"
This reverts commit 31747163
* chore(rebase): fix invalid references
* chore(router): add python dependency
* Lint.
* Fix bad rebase
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>