hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
drbh	e837d9f92f	fix: remove unneeded launcher args in continue test	2024-11-21 12:02:20 -05:00
David Holtz	5d4f1540c4	fix: remove continue_final_message chat request param	2024-11-21 12:02:20 -05:00
David Holtz	8e13d1565e	fix: bump openapi docs	2024-11-21 12:02:20 -05:00
David Holtz	08e190fb4e	feat: add test for continue final message	2024-11-21 12:02:20 -05:00
drbh	c6e0433442	feat: support continue_final_message param in chat request	2024-11-21 12:02:20 -05:00
OlivierDehaene	8e0c161d0a	fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770 ) * Incomplete generation stream fix (#2754) entries.len() could > batch.size in prefill, so need to filter as well. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * entries was wrongly extended for model that did not support chunking --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com>	2024-11-21 16:37:55 +00:00
Daniël de Kok	3c54488638	nix: downgrade to outlines 0.1.3 (#2768 )	2024-11-21 13:00:26 +01:00
drbh	6ee8d6dd3b	fix: set outlines version to 0.1.3 to avoid caching serialization issue (#2766 ) fix: set outlines version to 0.1.3 to avoid bug	2024-11-20 18:09:39 -05:00
Daniël de Kok	07bed530f7	nix: build and cache impure devshells (#2765 ) * nix: build and cache all devshells * nix: add poetry to the impure shell This shouldn't be used to manage dependencies in a Nix devshell, but can be handy to update `poetry.lock`. * Fix Nix build, disable pure shell (covered by Nix tests)	2024-11-20 20:56:11 +01:00
Daniël de Kok	46a5a7e73e	Add support for wNa16 int 2:4 compressed-tensors checkpoints (#2758 ) This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.	2024-11-20 18:25:23 +01:00
Daniël de Kok	2fda8845a7	nix: update for outlines 0.1.4 (#2764 )	2024-11-20 18:24:29 +01:00
Daniël de Kok	45013b60a4	Install compressed-tensors in Docker CPU builds	2024-11-20 14:17:47 +00:00
drbh	bd6e8b3c13	fix: adjust llama MLP name from dense to mlp to correctly apply lora (#2760 )	2024-11-19 15:10:22 -05:00
drbh	5489406c4a	PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645 ) * add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>	2024-11-19 13:31:59 -05:00
Daniël de Kok	2007a9473a	Update to moe-kernels 0.7.0 (#2720 ) This version syncs with the vLLM kernels and brings some performance improvements.	2024-11-19 14:55:29 +01:00
Daniël de Kok	b4ec427ad0	Simplify two ipex conditions (#2755 )	2024-11-19 08:04:23 +01:00
drbh	38cff84a3e	feat: support flash attention 2 in qwen2 vl vision blocks (#2721 ) * feat: support flash attention 2 in qwen2 vl vision blocks * fix: calc max_seqlen once and small refactors	2024-11-18 12:46:40 -05:00
Daniël de Kok	3c9df21ff8	Add support for compressed-tensors w8a8 int checkpoints (#2745 ) * Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: \| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|---------------\|------:\|----------------\|-----:\|-----------------------\|---\|-----:\|---\|------\| \|gsm8k_cot_llama\| 3\|flexible-extract\| 8\|exact_match \|↑ \|0.8431\|± \|0.0100\| \| \| \|strict-match \| 8\|exact_match \|↑ \|0.8393\|± \|0.0101\| \|ifeval \| 4\|none \| 0\|inst_level_loose_acc \|↑ \|0.8597\|± \| N/A\| \| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.8201\|± \| N/A\| \| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.7967\|± \|0.0173\| \| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.7468\|± \|0.0187\| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by: drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-11-18 17:20:31 +01:00
Wang, Yi	a5ecd6e586	add ipex moe implementation to support Mixtral and PhiMoe (#2707 ) * add ipex moe implementation to support Mixtral and PhiMoe Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update to ipex xpu 2.5 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * torch has xpu support in 2.5 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix oneapi basekit version Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-11-18 17:16:55 +01:00
drbh	fea62e928f	fix: improve find_segments via numpy diff (#2686 )	2024-11-18 09:51:06 -05:00
Daniël de Kok	52e48739a5	Remove vLLM dependency for CUDA (#2751 ) * Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning	2024-11-17 17:34:50 +01:00
drbh	6489f85269	feat: return streaming errors as an event formatted for openai's client (#2668 ) * feat: return streaming errors as an event formatted for openai's client * fix: propagate completions error events to stream * fix: improve stream api error format and add status code * fix: improve streamin error to include error_type * Revert "fix: improve streamin error to include error_type" This reverts commit `2b1a360b15`. * Reworked the implementation. * Revert "Reworked the implementation." This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5. * Small lifting. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-11-15 14:49:19 +01:00
Nicolas Patry	34a3bdedc3	Upgrading our deps. (#2750 ) * Upgrading our deps. * fixup. * Fixup.	2024-11-15 14:03:27 +01:00
Alex Weston	4580ced091	Upgrade outlines to 0.1.1 (#2742 ) * Upgrade outlines to 0.1.1 * Update for new API * Check if allowed tokens is None --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-11-15 13:22:52 +01:00
jito	003eaec0fb	fix response type of document for Text Generation Inference (#2743 ) Signed-off-by: jitokim <pigberger70@gmail.com>	2024-11-15 13:21:50 +01:00
Billel Mokeddem	4f4857a4ac	Fix: Change embeddings to embedding (#2738 ) fix: change embeddings to embedding Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>	2024-11-15 13:16:15 +01:00
Billel Mokeddem	f9ee46f740	Fix: Change model_type from ssm to mamba (#2740 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>	2024-11-15 13:15:36 +01:00
Daniël de Kok	8442f1ac85	benchmark: fix prefill throughput (#2741 )	2024-11-15 13:14:55 +01:00
Daniël de Kok	ca4f46ddfc	nix: update nixpkgs (#2746 ) Updates from Triton 2.1.0 to 3.1.0 (among other things).	2024-11-14 18:48:20 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
Wang, Yi	97f7a22f0b	add trust_remote_code in tokenizer to fix baichuan issue (#2725 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-07 14:43:38 +01:00
Wang, Yi	b1f9044d6c	fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717 ) fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ ipex kernel provide func like add_bias, so no need add it outside Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-04 16:07:51 +01:00
Daniël de Kok	5eedb2ec7a	nix: move to tgi-nix `main` (#2718 )	2024-11-04 15:40:13 +01:00
Nicolas Patry	9fde566602	Fixing linting on main. (#2719 )	2024-11-04 15:21:41 +01:00
Travis Addair	aadc9cb485	Fix prefix caching + speculative decoding (#2711 )	2024-11-04 15:08:43 +01:00
Nicolas Patry	a5593ba83e	Hotfixing auto length (warmup max_s was wrong). (#2716 )	2024-11-04 09:55:54 +01:00
drbh	08c4184eb2	fix: add chat_tokenize endpoint to api docs (#2710 )	2024-11-04 06:44:59 +01:00
drbh	6e3220529d	fix: create position ids for text only input (#2714 ) * fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone	2024-11-02 08:40:05 +08:00
drbh	01dacf8e8f	fix cuda graphs for qwen2-vl (#2708 ) * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-01 03:05:34 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Wang, Yi	46aeb0860d	add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702 ) add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-30 14:18:50 +01:00
Nicolas Patry	98330df65e	Monkey patching as a desperate measure. (#2704 ) * Monkey patching as a desperate measure. * New snapshot ?	2024-10-28 11:25:13 +01:00
Nicolas Patry	513d19b955	More timeout on docker start ? (#2701 ) * More timeout on docker start ? * Latest upgrade.	2024-10-28 08:57:22 +01:00
Nicolas Patry	3a9cdc3241	Fixing auto bloom test. (#2699 )	2024-10-28 06:14:11 +01:00
Nicolas Patry	78ce618c70	Update poetry lock. (#2698 )	2024-10-28 06:11:33 +01:00
Nicolas Patry	90b226db29	We can have a tokenizer anywhere. (#2527 ) * We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.	2024-10-28 05:00:24 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
Nicolas Patry	2e4f4ba1bb	Green main (#2697 )	2024-10-28 04:59:32 +01:00
Nicolas Patry	8a8794a672	Avoiding timeout for bloom tests. (#2693 ) * Avoiding timeout for bloom tests. * Skip the test let's see if it's always the first tests that fails. * Fail early. * Pulling ? * No early exit.	2024-10-26 05:35:28 +02:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00

1 2 3 4 5 ...

1149 Commits All Branches Search

1149 Commits

All Branches