Commit Graph

1158 Commits

Author SHA1 Message Date
Nicolas Patry 1d2cb356b9
Fix doc. (#2792) 2024-12-02 05:28:26 +01:00
drbh d471805134
Support continue final message (#2733)
* feat: support continue_final_message param in chat request

* feat: add test for continue final message

* fix: bump openapi docs

* fix: remove continue_final_message chat request param

* fix: remove unneeded launcher args in continue test

* fix: bump test output

* fix: remove accidentally included guideline from rebase

* fix: remove guideline tests

* fix: adjust continuation tests expected text

* fix: replace expected output for continue test
2024-11-27 19:13:30 -05:00
jp caff779dd4
Fix: docs typo (#2777)
Fix: typo in model loading code

Fix typo in model loading code
2024-11-26 14:28:58 +01:00
Wang, Yi 892a26e549
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… (#2778)
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention)

Signed-off-by: Wang,Yi A <yi.a.wang@intel.com>
2024-11-26 14:28:11 +01:00
Daniël de Kok 72ab60fdd5
Use FP8 KV cache when specified by compressed-tensors (#2761)
The compressed-tensors configuration can specify the configuration of
the KV cache as well. Use an FP8 KV cache when the configuration tells
us to do so (all other options and types are ignored for now).
2024-11-26 08:27:41 +01:00
Daniël de Kok 289aa48554
Move JSON grammar -> regex grammar conversion to the router (#2772)
* Move JSON grammar -> regex grammar conversion to the router

This change moves the JSON grammar -> regex grammar conversion to the
router by adding a dependency on the `outlines-core` Rust crate. In
contrast to the Python implementation, the conversions are not LRU-cached
since they seem to be fast enough:

simple schema           time:   [5.8293 µs 5.8307 µs 5.8320 µs]
                        change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05)
                        Performance has improved.

complex schema          time:   [14.875 µs 14.881 µs 14.887 µs]
                        change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05)
                        Performance has improved.

Using the schemas from:
https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
2024-11-25 18:47:34 +01:00
drbh c637d68d74
feat: concat the adapter id to the model id in chat response (#2779)
* feat: concat the adapter id to the model id in chat response

* fix: updated to include only the adapter id in chat response
2024-11-25 12:36:31 -05:00
OlivierDehaene 780531ec77
chore: prepare 2.4.1 release (#2773)
* chore: prepare 2.4.1 release

* fix tests

* fmt
2024-11-22 17:26:15 +00:00
Daniël de Kok e87893d38e
chore: Update to marlin-kernels 0.3.6 (#2771)
This fixes a bug in 2:4 Marlin:
https://github.com/vllm-project/vllm/pull/10464
2024-11-22 14:44:47 +00:00
OlivierDehaene ab7ccf5bc3
feat: add payload limit (#2726)
* feat: add payload limit

* update launcher
2024-11-21 18:20:15 +00:00
Hugo Larcher d5bc6a20bd
feat: Add automatic nightly benchmarks (#2591)
* feat: Add automatic nightly benchmarks

* fix: Update runners group

* fix: add created_at field to results

* fix: Add variable results file location
2024-11-21 17:11:42 +00:00
Lucain d012f229c6
Remove guideline from API (#2762) 2024-11-21 16:56:38 +00:00
Daniël de Kok c5b5b3a11c
docs: Add a README section about using Nix (#2767) 2024-11-21 16:53:27 +00:00
drbh faa10ad0bc
fix: tweak grammar test response (#2769) 2024-11-21 16:46:00 +00:00
OlivierDehaene 8e0c161d0a
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770)
* Incomplete generation stream fix (#2754)

entries.len() could > batch.size in prefill, so need to filter as well.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* entries was wrongly extended for model that did not support chunking

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
2024-11-21 16:37:55 +00:00
Daniël de Kok 3c54488638
nix: downgrade to outlines 0.1.3 (#2768) 2024-11-21 13:00:26 +01:00
drbh 6ee8d6dd3b
fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766)
fix: set outlines version to 0.1.3 to avoid bug
2024-11-20 18:09:39 -05:00
Daniël de Kok 07bed530f7
nix: build and cache impure devshells (#2765)
* nix: build and cache all devshells

* nix: add poetry to the impure shell

This shouldn't be used to manage dependencies in a Nix devshell, but can
be handy to update `poetry.lock`.

* Fix Nix build, disable pure shell (covered by Nix tests)
2024-11-20 20:56:11 +01:00
Daniël de Kok 46a5a7e73e
Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758)
This change adds support for wNa16 int checkpoints with 2:4 sparsity
using Marlin 2:4 kernels.
2024-11-20 18:25:23 +01:00
Daniël de Kok 2fda8845a7
nix: update for outlines 0.1.4 (#2764) 2024-11-20 18:24:29 +01:00
Daniël de Kok 45013b60a4 Install compressed-tensors in Docker CPU builds 2024-11-20 14:17:47 +00:00
drbh bd6e8b3c13
fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760) 2024-11-19 15:10:22 -05:00
drbh 5489406c4a
PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645)
* add OpenAI like tool_choice for named choice

* add tests

* fix: run linter and bump api docs

* fix: consolidate changes and remove old tool type

* feat: improve, simplify and rename tool choice struct add required support and refactor

* fix: simplify tool choice logic, improve tests, openapi and rust docs

* fix: refactor away prepare_chat_input and improve tool grammar apply control flow

* feat: update docs and add tool choice configuration section

* fix: simplify naming, tool choice default and improve test

* fix: adjust tool choice none logic, add test and small refactors

* fix: add missing snapshot file

* fix: adjust tool choice type in test

* fix: adjust default when json tool choice is

* fix: remove trailing space lint after rebase

* fix: remove mostly mocked unit test

---------

Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>
2024-11-19 13:31:59 -05:00
Daniël de Kok 2007a9473a
Update to moe-kernels 0.7.0 (#2720)
This version syncs with the vLLM kernels and brings some performance
improvements.
2024-11-19 14:55:29 +01:00
Daniël de Kok b4ec427ad0
Simplify two ipex conditions (#2755) 2024-11-19 08:04:23 +01:00
drbh 38cff84a3e
feat: support flash attention 2 in qwen2 vl vision blocks (#2721)
* feat: support flash attention 2 in qwen2 vl vision blocks

* fix: calc max_seqlen once and small refactors
2024-11-18 12:46:40 -05:00
Daniël de Kok 3c9df21ff8
Add support for compressed-tensors w8a8 int checkpoints (#2745)
* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo

Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-11-18 17:20:31 +01:00
Wang, Yi a5ecd6e586
add ipex moe implementation to support Mixtral and PhiMoe (#2707)
* add ipex moe implementation to support Mixtral and PhiMoe

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update to ipex xpu 2.5

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* torch has xpu support in 2.5

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix oneapi basekit version

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-11-18 17:16:55 +01:00
drbh fea62e928f
fix: improve find_segments via numpy diff (#2686) 2024-11-18 09:51:06 -05:00
Daniël de Kok 52e48739a5
Remove vLLM dependency for CUDA (#2751)
* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning
2024-11-17 17:34:50 +01:00
drbh 6489f85269
feat: return streaming errors as an event formatted for openai's client (#2668)
* feat: return streaming errors as an event formatted for openai's client

* fix: propagate completions error events to stream

* fix: improve stream api error format and add status code

* fix: improve streamin error to include error_type

* Revert "fix: improve streamin error to include error_type"

This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a.

* Reworked the implementation.

* Revert "Reworked the implementation."

This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5.

* Small lifting.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-11-15 14:49:19 +01:00
Nicolas Patry 34a3bdedc3
Upgrading our deps. (#2750)
* Upgrading our deps.

* fixup.

* Fixup.
2024-11-15 14:03:27 +01:00
Alex Weston 4580ced091
Upgrade outlines to 0.1.1 (#2742)
* Upgrade outlines to 0.1.1

* Update for new API

* Check if allowed tokens is None

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-11-15 13:22:52 +01:00
jito 003eaec0fb
fix response type of document for Text Generation Inference (#2743)
Signed-off-by: jitokim <pigberger70@gmail.com>
2024-11-15 13:21:50 +01:00
Billel Mokeddem 4f4857a4ac
Fix: Change embeddings to embedding (#2738)
fix: change embeddings to embedding

Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
2024-11-15 13:16:15 +01:00
Billel Mokeddem f9ee46f740
Fix: Change model_type from ssm to mamba (#2740)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
2024-11-15 13:15:36 +01:00
Daniël de Kok 8442f1ac85
benchmark: fix prefill throughput (#2741) 2024-11-15 13:14:55 +01:00
Daniël de Kok ca4f46ddfc
nix: update nixpkgs (#2746)
Updates from Triton 2.1.0 to 3.1.0 (among other things).
2024-11-14 18:48:20 +01:00
Daniël de Kok a785000842
Add initial support for compressed-tensors checkpoints (#2732)
compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.
2024-11-10 13:54:07 +01:00
Wang, Yi 97f7a22f0b
add trust_remote_code in tokenizer to fix baichuan issue (#2725)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-11-07 14:43:38 +01:00
Wang, Yi b1f9044d6c
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717)
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-11-04 16:07:51 +01:00
Daniël de Kok 5eedb2ec7a
nix: move to tgi-nix `main` (#2718) 2024-11-04 15:40:13 +01:00
Nicolas Patry 9fde566602
Fixing linting on main. (#2719) 2024-11-04 15:21:41 +01:00
Travis Addair aadc9cb485
Fix prefix caching + speculative decoding (#2711) 2024-11-04 15:08:43 +01:00
Nicolas Patry a5593ba83e
Hotfixing auto length (warmup max_s was wrong). (#2716) 2024-11-04 09:55:54 +01:00
drbh 08c4184eb2
fix: add chat_tokenize endpoint to api docs (#2710) 2024-11-04 06:44:59 +01:00
drbh 6e3220529d
fix: create position ids for text only input (#2714)
* fix: create position ids for text only input

* fix: prefer repeat over expand to avoid clone
2024-11-02 08:40:05 +08:00
drbh 01dacf8e8f
fix cuda graphs for qwen2-vl (#2708)
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl

* fix: only check model type if config exists

* fix: adjust sharding and lm head logic

* fix qwen2 failure in intel cpu

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: return correct shape logits and add streaming test

* fix: remove unused import and refactor test

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-11-01 03:05:34 +01:00
drbh befd9f6735
Support qwen2 vl (#2689)
* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2
2024-10-30 12:40:51 -04:00
Wang, Yi 46aeb0860d
add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702)
add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'"

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-10-30 14:18:50 +01:00