hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Daniël de Kok	959b9dc25f	Fixup constructor arguments	2024-07-17 07:42:24 +00:00
Daniël de Kok	27ef5aa029	Sync allocator interfaces	2024-07-16 14:42:32 +00:00
Daniël de Kok	48b21eab7a	Last accessed fixes	2024-07-16 11:55:46 +00:00
Daniël de Kok	dd2d6cfe40	Proper support for two allocations with overlapping prefixes	2024-07-16 11:40:35 +00:00
Daniël de Kok	d4ce5389ce	Add another problematic case	2024-07-16 10:20:11 +00:00
Daniël de Kok	0e6ff1293a	Fixes	2024-07-16 10:10:10 +00:00
Daniël de Kok	7c046c9190	First step towards cleaning up Breaks tests, but I want to shuffle around data structures so that we can just pass block ids to free.	2024-07-15 14:14:10 +00:00
Daniël de Kok	05611f6b40	Renaming, window size	2024-07-15 13:58:10 +00:00
Daniël de Kok	083806aa42	Traitify the current allocator in preparation for swappable alloc	2024-07-15 13:44:22 +00:00
Daniël de Kok	3b4754cd31	Better leaf tracking	2024-07-12 16:03:21 +02:00
Daniël de Kok	1a461234d5	Avoid continuous sorting during reclamation	2024-07-12 13:39:22 +02:00
Daniël de Kok	c352a3e231	Shake out some issues, add correct removal order test	2024-07-12 13:39:22 +02:00
Daniël de Kok	6d0094e5d4	docs/cleanups	2024-07-12 13:39:22 +02:00
Daniël de Kok	3b6bef4078	Walk up to predecessors	2024-07-12 13:39:22 +02:00
Daniël de Kok	9da64a7b16	Basic test passes	2024-07-12 13:39:22 +02:00
Daniël de Kok	dbb82e274c	WIP	2024-07-12 13:39:22 +02:00
drbh	d789de329a	fix: append DONE message to chat stream (#2221 ) * fix: append DONE message to chat stream * fix: update completions endpoint	2024-07-11 10:42:58 -04:00
Nicolas Patry	4c976fb406	Updating the self check (#2209 ) * Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.	2024-07-09 17:23:48 +02:00
Nicolas Patry	fe710af25f	Adding sanity check to openapi docs.	2024-07-09 11:13:48 +02:00
drbh	87ebb6477b	feat: use model name as adapter id in chat endpoints (#2128 )	2024-07-08 16:06:49 +02:00
Wang, Yi	58effe78b5	update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190 ) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-08 16:03:59 +02:00
Nicolas Patry	5ad41aa2a6	Fixing missing `object` field for regular completions. (#2175 ) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.	2024-07-03 12:56:27 +02:00
Nicolas Patry	be4a4c47f9	Revert "Fixing missing `object` field for regular completions." This reverts commit `2bbb7fa4b2`.	2024-07-03 10:41:39 +00:00
Nicolas Patry	2bbb7fa4b2	Fixing missing `object` field for regular completions.	2024-07-03 10:40:22 +00:00
drbh	571530dd9a	feat: improve update_docs for openapi schema (#2169 ) * feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff	2024-07-03 09:53:35 +02:00
Nicolas Patry	4327210e6b	[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-07-01 23:28:00 +02:00
drbh	9eefb2f672	fix: prefer serde structs over custom functions (#2127 ) * fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 15:08:05 +02:00
Nicolas Patry	b4552f9de9	Fixing clippy. (#2149 )	2024-07-01 12:02:19 +02:00
Wang, Yi	6ea570ddfe	fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148 ) * fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices] Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 11:27:53 +02:00
drbh	74b0231b19	fix: refactor post_processor logic and add test (#2137 ) * fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor	2024-06-27 23:16:19 +02:00
Nicolas Patry	0e4ab6d31c	Fixing malformed rust tokenizers (#2134 ) * Fixing malformed rust tokenizers * Fix for deepseek too.	2024-06-27 16:04:03 +02:00
Daniël de Kok	dd2d91b043	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-06-27 15:54:35 +02:00
Nicolas Patry	bcfcd4740a	Fixing prom leak by upgrading. (#2129 )	2024-06-27 08:08:43 +02:00
drbh	be2d38032a	fix: simplify kserve endpoint and fix imports (#2119 )	2024-06-25 19:30:10 -04:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
Nicolas Patry	a2a97b05d6	Fix CI . (#2118 ) Fix clippy.	2024-06-25 17:53:36 +02:00
sunxichen	b69f078041	fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089 ) Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>	2024-06-25 10:59:50 +02:00
KevinDuffy94	1869ee2f57	Add OTLP Service Name Environment Variable (#2076 ) * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option	2024-06-25 09:33:01 +02:00
Lucain	3447c722fd	Support `HF_TOKEN` environment variable (#2066 ) * Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-06-25 09:23:12 +02:00
Ziru Niu	0f7d38e774	fix build.rs watch files (#2072 )	2024-06-17 12:10:01 +02:00
drbh	f433f1f770	implement Open Inference Protocol endpoints (#1942 ) * feat: add kserve feature and basic routes * feat: implement infer endpoint wrapper around generate * fix: refactor and improve types * fix: improve infer and simplify * fix: cleanup and improve api docs * fix: refactor and encapsulate kserve feat in file * fix: remove typos after rebase	2024-06-13 12:51:51 -04:00
drbh	42aa8ee1bb	PR #2049 CI run (#2054 ) * Use minijinja's pycompat mode for python methods * fix: cargo fmt lint for pre commit --------- Co-authored-by: Armin Ronacher <armin.ronacher@active-4.com>	2024-06-13 11:53:49 -04:00
drbh	376a0b7ada	Support chat response format (#2046 ) * feat: support response_format in chat * fix: adjust typos * fix: add trufflehog lint	2024-06-11 10:44:56 -04:00
OlivierDehaene	8aece3bd68	feat: move allocation logic to rust (#1835 ) Close #2007	2024-06-05 12:18:38 +02:00
Nicolas Patry	8390e251d9	Making `make install` work better by default. (#2004 ) # What does this PR do? Making `make install` a much better sane default to start local dev environments. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-06-04 19:38:46 +02:00
OlivierDehaene	757223b352	feat: add SchedulerV3 (#1996 ) - Refactor code to allow supporting multiple versions of the generate.proto at the same time - Add v3/generate.proto (ISO to generate.proto for now but allow for future changes without impacting v2 backends) - Add Schedule trait to abstract queuing and batching mechanisms that will be different in the future - Add SchedulerV2/V3 impl	2024-06-04 15:56:56 +02:00
Daniël de Kok	df71aafdcc	router: send the input as chunks to the backend Before this change, the generation input was sent to the backend as a single string, encoding images as Base64 and packing them in Markdown-style links. This change adds a new chunked input representation that separates text chunks from images chunks. Image chunks contain binary data (for smaller message sizes) and the image's MIME type. The stringly-typed inputs are still sent to support backends that do not support chunked inputs yet.	2024-06-03 17:02:41 +02:00
Nicolas Patry	06edde9491	Purely refactors paged/attention into `layers/attention` and make hardware differences more obvious with 1 file per hardware. (#1986 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-31 17:57:01 +02:00
Nicolas Patry	612bc483b6	Fixing the text part from tokenizer endpoint. (#1967 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-28 16:55:36 +02:00
Nicolas Patry	e76b9824ae	Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959 ) - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's our time now - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files) hasn't yet, and hasn't for several months now, so let's disabled the feature for the time being. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-28 14:52:17 +02:00

1 2 3 4 5 ...

253 Commits