hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	07b01293c5	Prepare patch release. (#2829 )	2024-12-11 21:03:50 +01:00
Nicolas Patry	a2d878fa0f	Small update to docs (#2816 )	2024-12-10 10:46:26 +01:00
Nicolas Patry	b2fac5d947	Hotfix link2 (#2812 ) 2nd hotfix ?	2024-12-09 20:57:18 +01:00
Nicolas Patry	a70dd2998b	Hotfixing the link. (#2811 )	2024-12-09 20:50:07 +01:00
Nicolas Patry	042791fbd5	Prep new version (#2810 ) * New version. * Link fixup. * Update docs. * FIxup.	2024-12-09 20:42:42 +01:00
Nicolas Patry	27fa83ca5b	V3 doc (#2809 ) * V3 document. * Updating asset.	2024-12-09 19:58:07 +01:00
Nicolas Patry	5df8059037	Auto max prefill (#2797 ) * Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).	2024-12-06 05:52:00 +01:00
Torsten Raudssus	a35d1e6fe5	Removing ../ that broke the link (#2789 )	2024-12-02 05:48:55 +01:00
jp	caff779dd4	Fix: docs typo (#2777 ) Fix: typo in model loading code Fix typo in model loading code	2024-11-26 14:28:58 +01:00
OlivierDehaene	780531ec77	chore: prepare 2.4.1 release (#2773 ) * chore: prepare 2.4.1 release * fix tests * fmt	2024-11-22 17:26:15 +00:00
OlivierDehaene	ab7ccf5bc3	feat: add payload limit (#2726 ) * feat: add payload limit * update launcher	2024-11-21 18:20:15 +00:00
drbh	5489406c4a	PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645 ) * add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by: Linus Bierhoff <linus.bierhoff@icloud.com>	2024-11-19 13:31:59 -05:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00
OlivierDehaene	41c2623735	feat: allow any supported payload on /invocations (#2683 ) * feat: allow any supported payload on /invocations * update openAPI * update doc	2024-10-23 11:26:01 +00:00
OlivierDehaene	03c9388bf7	feat: natively support Granite models (#2682 ) * feat: natively support Granite models * Update doc	2024-10-23 10:04:05 +00:00
Daniël de Kok	5bbe1ce028	Support `e4m3fn` KV cache (#2655 ) * Support `e4m3fn` KV cache * Make check more obvious	2024-10-17 10:42:16 +02:00
Omar Sanseviero	51f5401893	Clarify gated description and quicktour (#2631 ) Update quicktour.md	2024-10-14 16:31:37 +02:00
Omar Sanseviero	ce28ee88d5	Small fixes for supported models (#2471 ) * Small improvements for docs * Update _toctree.yml * Updating the doc (we keep the list actually). --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-14 15:26:39 +02:00
vb	d912f0bf55	Update documentation to most recent stable version of TGI. (#2625 ) Update to most recent stable version of TGI.	2024-10-10 16:00:25 +02:00
Daniël de Kok	2358c2bb54	Add basic FP8 KV cache support (#2603 ) * Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml	2024-10-04 17:51:48 +02:00
drbh	3011639ff7	Revert "Unroll notify error into generate response" (#2605 ) Revert "Unroll notify error into generate response (#2597)" This reverts commit `d22b0c1fbe`.	2024-10-03 17:56:40 -04:00
Nicolas Patry	f6e2f05b16	New release 2.3.1 (#2604 ) * New release 2.3.1 * Update doc number	2024-10-03 14:43:49 +02:00
drbh	d22b0c1fbe	Unroll notify error into generate response (#2597 ) * feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file	2024-10-02 11:34:57 -04:00
drbh	2335459556	CI (2592): Allow LoRA adapter revision in server launcher (#2602 ) allow revision for lora adapters from launcher Co-authored-by: Sida <sida@kulamind.com> Co-authored-by: teamclouday <teamclouday@gmail.com>	2024-10-02 10:51:04 -04:00
Nicolas Patry	d18ed5cfc5	Mllama flash version (#2585 ) * Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0	2024-10-02 11:22:13 +02:00
drbh	93a7042d7e	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-30 11:15:09 +02:00
Mohit Sharma	f9e561eced	Update ROCM libs and improvements (#2579 ) * style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile	2024-09-30 10:54:32 +02:00
Ikram Ul Haq	e790cfc0e4	Update architecture.md (#2577 )	2024-09-30 08:56:20 +02:00
Nicholas Broad	7efcb5e0ed	remove LORA_ADAPTERS_PATH (#2563 ) specify how to call local adapters	2024-09-25 01:20:15 +02:00
Aritra Roy Gosthipaty	e6d29656b5	Adding note for private models in quick-tour document (#2548 ) * chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: vb <vaibhavs10@gmail.com>	2024-09-24 15:06:53 +02:00
Nicolas Patry	169178b937	Preparing for release. (#2540 ) * Preparing for release. * Upgrade version in docs.	2024-09-20 17:42:04 +02:00
Daniël de Kok	abd24dd385	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-19 22:17:15 +02:00
Martin Iglesias Goyanes	aaea212d0f	Add links to Adyen blogpost (#2500 ) * Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-06 17:00:54 +02:00
Nicolas Patry	8b96a18265	Adding links to Adyen blogpost. (#2492 )	2024-09-05 16:11:52 +02:00
Wang, Yi	9883f3b40e	update doc with intel cpu part (#2420 ) * update doc with intel cpu part Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release. --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-29 17:42:02 +02:00
drbh	8f99f165ce	fix: improve regex expression (#2468 )	2024-08-28 13:44:44 -04:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00
Nicolas Patry	cb0a29484d	FIxing the CI.	2024-08-16 14:21:29 +02:00
Vaibhav Srivastav	99b662f8c2	Improve the Consuming TGI + Streaming docs. (#2412 ) * Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by: Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>	2024-08-16 12:43:08 +02:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
Vaibhav Srivastav	b2b9c42724	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-08-09 15:01:34 +02:00
Vaibhav Srivastav	cb3ae30284	Update Quantization docs and minor doc fix. (#2368 ) * Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co>	2024-08-08 16:06:57 -04:00
drbh	21267f3ca3	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-07 21:32:37 -04:00
drbh	dd47a3dac4	feat: include local lora adapter loading docs (#2359 )	2024-08-05 12:36:44 -04:00
Erik Kaunismäki	7451041ecd	refactor usage stats (#2339 ) * refactor usage stats * Update docs/source/usage_statistics.md Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * changes based on feedback * run python3 udpate_doc.py * fix pre-commit * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * delete option around usage stats arg --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-31 16:29:07 +02:00
Erik Kaunismäki	583d37a2f8	Run ci api key (#2315 ) * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>	2024-07-29 11:14:17 +02:00
Nicolas Patry	5d121a9705	Preparing for release. (#2285 ) * Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.	2024-07-23 16:20:17 +02:00

1 2 3 4

161 Commits