hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
David Holtz	70066e6d8c	fix: remove continue_final_message chat request param	2024-11-22 14:10:46 -05:00
David Holtz	d6280141de	fix: bump openapi docs	2024-11-22 14:10:46 -05:00
OlivierDehaene	780531ec77	chore: prepare 2.4.1 release (#2773 ) * chore: prepare 2.4.1 release * fix tests * fmt	2024-11-22 17:26:15 +00:00
OlivierDehaene	ab7ccf5bc3	feat: add payload limit (#2726 ) * feat: add payload limit * update launcher	2024-11-21 18:20:15 +00:00
Lucain	d012f229c6	Remove guideline from API (#2762 )	2024-11-21 16:56:38 +00:00
drbh	5489406c4a	PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645 ) * add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>	2024-11-19 13:31:59 -05:00
jito	003eaec0fb	fix response type of document for Text Generation Inference (#2743 ) Signed-off-by: jitokim <pigberger70@gmail.com>	2024-11-15 13:21:50 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
drbh	08c4184eb2	fix: add chat_tokenize endpoint to api docs (#2710 )	2024-11-04 06:44:59 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00
OlivierDehaene	41c2623735	feat: allow any supported payload on /invocations (#2683 ) * feat: allow any supported payload on /invocations * update openAPI * update doc	2024-10-23 11:26:01 +00:00
OlivierDehaene	03c9388bf7	feat: natively support Granite models (#2682 ) * feat: natively support Granite models * Update doc	2024-10-23 10:04:05 +00:00
Daniël de Kok	5bbe1ce028	Support `e4m3fn` KV cache (#2655 ) * Support `e4m3fn` KV cache * Make check more obvious	2024-10-17 10:42:16 +02:00
Nicolas Patry	cf04a43fb1	Fixing linters. (#2650 )	2024-10-15 12:43:49 +02:00
Omar Sanseviero	51f5401893	Clarify gated description and quicktour (#2631 ) Update quicktour.md	2024-10-14 16:31:37 +02:00
Omar Sanseviero	ce28ee88d5	Small fixes for supported models (#2471 ) * Small improvements for docs * Update _toctree.yml * Updating the doc (we keep the list actually). --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-14 15:26:39 +02:00
vb	d912f0bf55	Update documentation to most recent stable version of TGI. (#2625 ) Update to most recent stable version of TGI.	2024-10-10 16:00:25 +02:00
drbh	8ad20daf33	CI (2599): Update ToolType input schema (#2601 ) * Update ToolType input schema * lint * fix: run formatter * fix: allow tool choide to be null --------- Co-authored-by: Wauplin <lucainp@gmail.com>	2024-10-08 12:35:48 -04:00
Daniël de Kok	2358c2bb54	Add basic FP8 KV cache support (#2603 ) * Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml	2024-10-04 17:51:48 +02:00
drbh	3011639ff7	Revert "Unroll notify error into generate response" (#2605 ) Revert "Unroll notify error into generate response (#2597)" This reverts commit `d22b0c1fbe`.	2024-10-03 17:56:40 -04:00
Nicolas Patry	f6e2f05b16	New release 2.3.1 (#2604 ) * New release 2.3.1 * Update doc number	2024-10-03 14:43:49 +02:00
drbh	d22b0c1fbe	Unroll notify error into generate response (#2597 ) * feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file	2024-10-02 11:34:57 -04:00
drbh	2335459556	CI (2592): Allow LoRA adapter revision in server launcher (#2602 ) allow revision for lora adapters from launcher Co-authored-by: Sida <sida@kulamind.com> Co-authored-by: teamclouday <teamclouday@gmail.com>	2024-10-02 10:51:04 -04:00
Nicolas Patry	d18ed5cfc5	Mllama flash version (#2585 ) * Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0	2024-10-02 11:22:13 +02:00
drbh	93a7042d7e	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-30 11:15:09 +02:00
Mohit Sharma	f9e561eced	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-09-30 10:54:32 +02:00
Ikram Ul Haq	e790cfc0e4	Update architecture.md (#2577 )	2024-09-30 08:56:20 +02:00
Nicholas Broad	7efcb5e0ed	remove LORA_ADAPTERS_PATH (#2563 ) specify how to call local adapters	2024-09-25 01:20:15 +02:00
Aritra Roy Gosthipaty	e6d29656b5	Adding note for private models in quick-tour document (#2548 ) * chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: vb <vaibhavs10@gmail.com>	2024-09-24 15:06:53 +02:00
Nicolas Patry	169178b937	Preparing for release. (#2540 ) * Preparing for release. * Upgrade version in docs.	2024-09-20 17:42:04 +02:00
Daniël de Kok	abd24dd385	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-19 22:17:15 +02:00
Nicolas Patry	f512021e77	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-19 20:50:37 +02:00
Martin Iglesias Goyanes	aaea212d0f	Add links to Adyen blogpost (#2500 ) * Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-06 17:00:54 +02:00
Nicolas Patry	8b96a18265	Adding links to Adyen blogpost. (#2492 )	2024-09-05 16:11:52 +02:00
Wang, Yi	9883f3b40e	update doc with intel cpu part (#2420 ) * update doc with intel cpu part Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release. --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-29 17:42:02 +02:00
drbh	d5202c46f7	feat: add /v1/models endpoint (#2433 ) * feat: add /v1/models endpoint * feat: add /v1/models endpoint * fix: remove unused type import * fix: revert route typo * fix: update docs with new endpoint * fix: add to redocly ignore and lint	2024-08-29 16:32:38 +02:00
drbh	8f99f165ce	fix: improve regex expression (#2468 )	2024-08-28 13:44:44 -04:00
drbh	cfa73b5c99	Pr 2451 ci branch (#2454 ) * fix[router]: Fix tools not passed in chat template Signed-off-by: GitHub <noreply@github.com> * feat: improve default tool serialization and lints * feat: refactor tool logic to include notify_error in prompt and adjust typing * fix: adjust non tool template apply * fix: simplify tool grammar logic and improve schema * feat: avoid skip tool test and avoid empty tool prompts * fix: increase test client timeout for grammar compilation tests --------- Signed-off-by: GitHub <noreply@github.com> Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-08-26 20:19:38 -04:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00
Nicolas Patry	cb0a29484d	FIxing the CI.	2024-08-16 14:21:29 +02:00
Vaibhav Srivastav	99b662f8c2	Improve the Consuming TGI + Streaming docs. (#2412 ) * Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by: Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>	2024-08-16 12:43:08 +02:00
drbh	30395b09f4	fix: improve completions to send a final chunk with usage details (#2336 ) * fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema	2024-08-12 17:26:11 +02:00
drbh	0d06aed02d	feat: add guideline to chat request and template (#2391 ) * feat: add guideline to chat request and template * fix: add template test and update docs	2024-08-09 10:56:45 -04:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
Vaibhav Srivastav	b2b9c42724	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-08-09 15:01:34 +02:00
Vaibhav Srivastav	cb3ae30284	Update Quantization docs and minor doc fix. (#2368 ) * Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co>	2024-08-08 16:06:57 -04:00
drbh	21267f3ca3	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-07 21:32:37 -04:00
drbh	dd47a3dac4	feat: include local lora adapter loading docs (#2359 )	2024-08-05 12:36:44 -04:00

1 2 3 4 5

214 Commits