hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
drbh	de6cb15fa5	fix: improve tool type, bump pydantic and outlines (#1650 ) This PR resolves a couple - [X] adjusts the tool response to align with openai's tools response type - [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue when running tests) - [X] bump `outlines` version and fix import for new name	2024-03-21 12:45:56 -04:00
drbh	dfbd9a39a2	feat: bump minijina and add test for core templates (#1626 ) This PR bumps `minijinja` and adds tests for all core models as identified by @xenova 🙏 Inspiration: https://github.com/huggingface/huggingface.js/blob/main/packages/jinja/test/e2e.test.js TODO: - [X] add new test to iterate over known templates - [X] add default templates - [x] add custom templates	2024-03-20 09:13:46 -04:00
Lucain	23fba672e8	Fix index in ChatCompletionChunk (#1648 ) Fix a small inconsistency compared the OpenAI's chat-completion behavior (introduced in https://github.com/huggingface/text-generation-inference/pull/1427 cc @drbh). When using `stream=True`, each chunk has an `index` value in `ChatCompletionChoice`. This index is not meant to be the index of the generated token but the index of the choice, which is always 0 (since TGI always return a single choice). See https://platform.openai.com/docs/api-reference/chat/object: > index _integer_ > The index of the choice in the list of choices. --- So instead of ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":1,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":2,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":3,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` if should return ```js data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"I"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"'"},"logprobs":null,"finish_reason":null}]} data:{"id":"","object":"text_completion","created":1710508199,"model":"HuggingFaceH4/zephyr-7b-beta","system_fingerprint":"1.4.3-sha-e6bb3ff","choices":[{"index":0,"delta":{"role":"assistant","content":"m"},"logprobs":null,"finish_reason":"length"}]} ``` EDIT: I also edited ToolCall.index to be always `0` (instead of the generated token index) but for this one I'm actually unsure. It might be the index of the tool in the array of tools? OpenAI's documentation doesn't provide any information about it: > index _integer_ --- I also noticed that in OpenAI's example, the last chunk doesn't have a delta and is the only one that has a `finish_reason` returning. TGI is slightly different since the last chunk has both the last delta (i.e. the last generated token) + the finish reason. I don't think this is worth fixing since it is not a requirement according to the docs/specs (at least not that I know of).	2024-03-16 12:14:29 -04:00
drbh	7e08751378	fix: add missing stop parameter for chat request (#1619 ) This PR adds the missing `stop` parameter to the `ChatRequest` struct which allows calls to specify a list of stop sequences	2024-03-01 12:08:11 -05:00
drbh	3dd7da2198	feat: accept legacy request format and response (#1527 ) This WIP PR (will) add support for legacy OpenAI `v1/completions` API. This should allow TGI to be a drop in replacement for OpenAI when using tools that rely on the completions api Should fix: https://github.com/huggingface/text-generation-inference/issues/1468	2024-02-29 10:44:20 -05:00
Nicolas Patry	910d0a9062	Fixing x-compute-time. (#1606 ) # What does this PR do? It was meant to be in seconds float <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-02-28 11:30:37 +01:00
drbh	9b6db5f793	Support tools (#1587 ) This work in progress PR begins to add support for tools. Tools relies on grammar support and still has some unsolved challenges. Opening the PR for visibility and feedback	2024-02-28 11:10:27 +01:00
drbh	ac5a1c6f51	fix: avoid default message (#1579 ) This PR avoids setting a default message in order to avoid unexpected generations	2024-02-22 08:56:42 -05:00
OlivierDehaene	010508cec8	fix: fix openapi schema (#1586 )	2024-02-21 15:30:45 +01:00
OlivierDehaene	9c1cb81cd8	v1.4.2 (#1585 )	2024-02-21 14:50:57 +01:00
OlivierDehaene	fa8a8e05af	fix(router): fix openapi and add jsonschema validation (#1578 )	2024-02-21 11:05:32 +01:00
drbh	c9f4c1af31	fix: refactor syntax to correctly include structs (#1580 ) This PR fixes a compilation bug related to conditionally adding docs behind a feature flag	2024-02-20 10:38:35 -05:00
drbh	df23062574	improve endpoint support (#1577 ) small PR to add a new interface endpoint behind a feature	2024-02-20 14:04:51 +01:00
OlivierDehaene	0f2daad8b9	feat: add chat template struct to avoid tuple ordering errors (#1570 )	2024-02-16 16:37:32 +01:00
OlivierDehaene	9946165ee0	chore: add pre-commit (#1569 )	2024-02-16 11:58:58 +01:00
Aaron Mihalik	142cdabed3	Bugfix: eos and bos tokens positions are inconsistent (#1567 )	2024-02-16 11:44:04 +01:00
Aaron Mihalik	c55abac384	Added `name` field to OpenAI compatible API Messages (#1563 ) # What does this PR do? Literally just adds the name field to the Message class. I verified this change by building a new docker container (using the `Dockerfile` in the repo) and trialing with a `chat_template` that uses the `name` field. Here's the previous behavior: Input messages: ``` { "messages": [ {"role": "system", "content": "You are a succinct but helpful AI Assistant listening to a chat server. Address everyone by @<username>"}, {"role": "user", "name": "Aaron", "content": "Hello There!"}, {"role": "assistant", "content": " Hello @Aaron! How can I assist you today?"}, {"role": "user", "name": "Sally", "content": "Hiya everyone. Is @Aaron is this room?"} ], "model": "meta-llama/Llama-2-7b-chat-hf" } ``` Response before the modification: ``` Hello @Aaron! Yes, you are in the chat room. How can I assist you today? 😊 Hiya everyone! waves It's great to see you all here. Is there something on your mind that you'd like to talk about or ask? I'm here to listen and help in any way I can. 🤖 ``` Response after my modification: ``` Hello @Sally! Yes, @Aaron is currently in the chat room. How may I assist you today? ``` Fixes #1558 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @Narsil --------- Co-authored-by: Aaron Mihalik <aaron.mihalik@parsons.us> Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-02-15 13:30:31 -05:00
drbh	cef0553d59	Outlines guided generation (#1539 ) This WIP PR starts to add grammar support via outlines, currently this PR supports very simple regex grammars and does not optimize for precompiling or caching grammar fsm's. todo: - [X] add simple outlines guidance to `NextTokenChooser` - [X] update protos for grammar - [X] update generation params API - [X] constrain simple grammar - [ ] support parsing more complex grammar into fsm - [ ] support all outline support grammar types - [ ] explore optimizations to avoid recompiling grammars guided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data-raw '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6, "grammar": "[\\w-]+@([\\w-]+\\.)+[\\w-]+" } }' \| jq ``` response ```json { "generated_text": "david@example.com" } ``` unguided request ```bash curl -s 'http://localhost:3000/generate' \ --header 'Content-Type: application/json' \ --data '{ "inputs": "make an email for david: \n", "parameters": { "max_new_tokens": 6 } }' \| jq ``` response ```json { "generated_text": " email = 'david" } ```	2024-02-15 10:28:10 +01:00
drbh	246ad39d04	feat: add deserialize_with that handles strings or objects with content (#1550 ) This PR adds a simple custom `deserialize_with` function that parses a string or an object with a content property. This should help support more token configuration files stored on the hub	2024-02-13 10:01:02 -05:00
OlivierDehaene	532146338b	feat(router): add max_batch_size (#1542 ) Some hardware require a maximum batch size.	2024-02-09 12:38:41 +01:00
OlivierDehaene	09b7c26bbd	feat(server): add frequency penalty (#1541 )	2024-02-08 18:41:25 +01:00
drbh	1734540211	feat: use existing add_generation_prompt variable from config in temp… (#1533 ) This PR adds support to read the `add_generation_prompt` from the config and use it in the chat template. If `add_generation_prompt` does not exist we default to false	2024-02-07 09:35:53 +01:00
Nicolas Patry	0e97af456a	Updating tokenizers. (#1517 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-02-01 16:26:48 +01:00
drbh	ee1cf51ce7	fix: tokenizer config should use local model path when possible (#1518 ) This PR fixes the issue with loading a local tokenizer config. Previously the default functionality would look in the current working directory. Now if a local model path is specified we will check that directory for the tokenizer_config. ## Examples of valid commands uses tokenizer_config from hub ``` text-generation-launcher --model-id HuggingFaceH4/zephyr-7b-beta ``` use tokenizer_config from local model path ``` text-generation-launcher \ --model-id ~/.cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/dc24cabd13eacd3ae3a5fe574bd645483a335a4a/ ``` use specific tokenizer_config file ``` text-generation-launcher \ --model-id ~/.cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/dc24cabd13eacd3ae3a5fe574bd645483a335a4a/ \ --tokenizer-config-path ~/.cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/dc24cabd13eacd3ae3a5fe574bd645483a335a4a/tokenizer_config.json ``` --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-02-01 09:39:32 -05:00
Nicolas Patry	9ad7b6a1a1	Hotfix the / health - route. (#1515 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-02-01 13:29:04 +01:00
Nicolas Patry	a9ea60684b	Create the compute type at launch time (if not provided in the env). (#1505 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-01-29 12:30:50 +01:00
Nicolas Patry	0424dabb01	Sending compute type from the environment instead of hardcoded string (#1504 ) # What does this PR do? Sending compute type from the environment instead of hardcoded string Using env is slow, therefore getting it from global state instead. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-01-29 11:20:08 +01:00
Nicolas Patry	ebecc06161	Update the docs to include newer models. (#1492 )	2024-01-26 16:07:31 +01:00
Nicolas Patry	4c7315dde5	Trying to fix that flaky test. (#1491 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-01-26 14:06:27 +01:00
drbh	13dd8e2361	fix: show warning with tokenizer config parsing error (#1488 ) This tiny PR just prints the parsing error when a tokenizer config fails to load. This is helpful when a chat_template wont load due to formatting issues https://github.com/huggingface/text-generation-inference/pull/1427#issuecomment-1909226388	2024-01-26 10:41:39 +01:00
Nicolas Patry	86c8335f1b	Add a new `/tokenize` route to get the tokenized input (#1471 ) # What does this PR do? Ideally this is done client side, but this is a recurring request, therefore we implemented it. - Runs only if rust tokenizer is present (not encumbering the main inference pipeline is important). - Returns simple results, ID, text (gotten with offsets from the original string) and offsets (so users can do things like highlighting text). <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-01-25 14:19:03 +01:00
drbh	7872b8c55b	Add messages api compatibility docs (#1478 ) This PR adds a new page to the docs that describes the Messages API and how to use it. Additionally this page will contain cloud provider specific information for enabling and using this feature. This PR includes a SageMaker example/information.	2024-01-24 11:41:28 -05:00
Jacob Keisling	82f87ada6f	Disable `decoder_input_details` on OpenAI-compatible chat streaming, pass temp and top-k from API (#1470 ) This PR makes some minor tweaks to the new OpenAI-compatible chat endpoint #1427 in `GenerateParameters`: - Disables `decoder_input_details` when streaming is enabled. This was causing all streaming chat requests to fail before, since [`decoder_input_details`==true is not enabled when streaming tokens](`98e5faff9d/router/src/validation.rs (L406)`). - Passes through `temperature` and `top_p` hyperparameters from the API request to `GenerateParameters` ## Testing ```bash curl localhost:8080/v1/chat/completions \ -X POST \ -d '{ "model": "", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` Should work correctly. Currently, most recent release from `main` returns error: ``` data:{"error":"Input validation error: `decoder_input_details` == true is not supported when streaming tokens","error_type":"validation"} ``` It's my first time contributing to this project, so I could be missing something. Would especially appreciate @drbh's eyes on this one	2024-01-23 09:55:05 -05:00
drbh	98e5faff9d	feat: conditionally toggle chat on invocations route (#1454 ) This PR adds support for reading the `OAI_ENABLED` env var which will changes the function called when the `/invocations` is called. If `OAI_ENABLED=true` the `chat_completions` method is used otherwise it defaults to `compat_generate`. example running the router ```bash OAI_ENABLED=true \ cargo run -- \ --tokenizer-name mistralai/Mistral-7B-Instruct-v0.2 ``` example request ```bash curl localhost:3000/invocations \ -X POST \ -d '{ "model": "tgi", "messages": [ { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": false, "max_tokens": 20, "logprobs": true, "seed": 0 }' \ -H 'Content-Type: application/json' \| jq ``` **please let me know if any naming changes are needed or if any other routes need similar functionality.	2024-01-22 10:29:01 -05:00
drbh	becd09978c	chore: bump rust version and annotate/fix all clippy warnings (#1455 ) This PR just bumps the latest rust version and makes clippy happy ```bash cargo clippy --all -- -D warnings # Finished dev [unoptimized + debuginfo] target(s) in 0.10s ```	2024-01-22 15:22:54 +01:00
drbh	3ccb3bb0b5	feat: support raise_exception, bos and eos tokens (#1450 ) This PR adds support to handle the custom jinja function `raise_exception` and passes the `bos` and `eos` tokens into the template Additionally this PR adds 3 tests to validate and show examples of what can and cannot be parsed currently. ```bash cargo test --package text-generation-router --lib -- infer::tests --nocapture # Finished test [unoptimized + debuginfo] target(s) in 7.82s # Running unittests src/lib.rs (target/debug/deps/text_generation_router-18a0bbf99c2ca1b4) # running 3 tests # test infer::tests::test_chat_template_valid_with_raise ... ok # test infer::tests::test_chat_template ... ok # test infer::tests::test_chat_template_invalid_with_raise ... ok # test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 15 filtered out; finished in 0.00s ```	2024-01-18 12:31:56 +01:00
drbh	0eabc83541	feat: supports openai chat completions API (#1427 ) This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface. Notes - TGI inits a single model at startup so the `model` field is unused in HTTP requests. - `max_tokens` and `stream` should work as expected but other params may be (unimplemented or not supported) General approach - fetch the `tokenizer_config` at startup from the hub - pass `tokenizer_config` into `Infer` so we have it at request time - use the `chat_template` on the config to format chat request - parse jinja template and render chat string - pass inputs into existing generate function - wrap generation output in expected structure before returning # How to test ### Streaming curl ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "tgi", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` It is also possible to use the `openai` python library and change the base url ### 🌊 STREAMING REQUEST ```python from openai import OpenAI # init the client but point it to TGI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=True ) # iterate and print stream for message in chat_completion: print(message) # ChatCompletionChunk(id='', choices=[Choice(delta=ChoiceDelta(content=' that', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=2, logprobs=None)], created=1704486761, model='', object='text_completion', system_fingerprint='') ``` ### 🚗 SYNCHRONOUS REQUEST ```python from openai import OpenAI # init the client but point it to TGI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=False ) print(chat_completion) # ChatCompletion(id='', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='\nDeep learning is a new field of research that has been gaining traction in the last ...', role='assistant', function_call=None, tool_calls=None))], created=1704486762, model='', object='text_completion', system_fingerprint='', usage=CompletionUsage(completion_tokens=100, prompt_tokens=76, total_tokens=176)) ``` ## How to run dev ```bash cd text-generation-inference/server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2 ``` ***note many of the existing `chat_templates` use non standard `jinja` (ie. adding a `raise` to the template) which will throw an error when parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a valid template ```bash cd text-generation-inference/router cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 ``` trigger ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \ -H 'Content-Type: application/json' ``` ^ supports `stream: true` and `stream: false` requests	2024-01-16 11:07:41 +01:00
Nicolas Patry	ac08b4ef9c	Return prompt vs generated tokens. (#1436 ) # What does this PR do? Fixes #637 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-01-11 13:01:43 -05:00
OlivierDehaene	fbeb1c4475	fix: follow base model for tokenizer in router (#1424 ) Close #1422	2024-01-10 16:35:54 +01:00
OlivierDehaene	d077150eb7	fix: fix gpt-q with groupsize = -1 (#1358 )	2023-12-18 16:07:05 +01:00
OlivierDehaene	8428ed1011	fix: fix offline (#1341 ) (#1347 ) @oOraph --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2023-12-18 10:20:08 +01:00
OlivierDehaene	50b495f3d8	feat: add more latency metrics in forward (#1346 )	2023-12-14 15:59:38 +01:00
OlivierDehaene	28821bfd5d	fix: default max_new_tokens to 100	2023-12-13 09:19:19 +01:00
OlivierDehaene	3a521c92b3	feat: mixtral (#1328 )	2023-12-11 14:43:40 +01:00
Nicolas Patry	9ecfa16b12	Speculative (#1308 )	2023-12-11 12:46:30 +01:00
Nicolas Patry	ed2a3f617e	Exllama v2 (#1211 ) # What does this PR do? See #1165 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-153.ec2.internal>	2023-11-25 22:38:38 +01:00
Nicolas Patry	3c02262f29	Reduce race condition on file system for test	2023-11-23 15:42:48 +00:00
OlivierDehaene	3dbc649b11	fix: do not leak inputs on error (#1228 ) Close #1225	2023-11-20 10:33:44 +01:00
OlivierDehaene	f9910d13e2	feat: remove flume (#1184 )	2023-10-23 15:51:12 +02:00
OlivierDehaene	5e28f44a83	#1049 CI (#1178 ) See #1049 --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com>	2023-10-20 10:28:45 +02:00
OlivierDehaene	20ee71dcf5	fix: force one of max_new_tokens or truncate with slow tokenizer	2023-10-11 10:46:40 +02:00
Nicolas Patry	6df43da0a4	Modify the default for `max_new_tokens`. (#1097 ) # What does this PR do? Now clients which do not specify a max_length will be implying `max_new_tokens = max_total_tokens - input_length`. This is a serious change, but which seems more in line with what users expect from standing server. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-10-04 17:38:42 +02:00
OlivierDehaene	3b56d7669b	feat: add mistral model (#1071 )	2023-09-28 09:55:47 +02:00
Nicolas Patry	a049864270	Preping 1.1.0 (#1066 ) # What does this PR do? Upgrade all relevant versions and dependencies. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-09-27 10:40:18 +02:00
Nicolas Patry	211b54ac41	Rebased #617 (#868 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>	2023-08-28 11:43:47 +02:00
Nicolas Patry	05dd14fdb9	Fix `tokenizers==0.13.4` . (#838 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-08-14 19:26:19 +02:00
ivarflakstad	8bdb16ee9a	Use destructuring in router arguments to avoid '.0' (#798 ) # What does this PR do? This is purely code style - not anything important. Instead of writing `req.0` all over we can use [descructuring](https://doc.rust-lang.org/rust-by-example/flow_control/match/destructuring/destructure_structures.html) to access the contained value that we actually want. (Destructuring in function parameters [here](https://doc.rust-lang.org/reference/items/functions.html#function-parameters)) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene	2023-08-10 10:52:50 +02:00
OlivierDehaene	afd04dc71e	feat(server): update vllm version (#723 )	2023-07-28 15:36:38 +02:00
OlivierDehaene	73a4d65d26	feat: add cuda memory fraction (#659 ) Close #673	2023-07-24 11:43:58 +02:00
OlivierDehaene	1da642bd0e	feat(server): add local prom and health routes if running w/ ngrok	2023-07-21 16:56:30 +02:00
OlivierDehaene	b66b190403	feat(router): ngrok edge (#642 )	2023-07-19 11:59:58 +02:00
OlivierDehaene	fe80f5360c	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
OlivierDehaene	982ce3227b	feat(router): explicit warning if revision is not set (#608 )	2023-07-13 18:59:38 +02:00
OlivierDehaene	b7327205a6	feat(launcher): add arg validation and drop subprocess (#595 )	2023-07-13 14:22:37 +02:00
OlivierDehaene	b4024edd45	feat: better errors for warmup and TP (#575 ) Close #571	2023-07-10 14:47:15 +02:00
OlivierDehaene	6f42942772	feat(router): add argument for hostname in router (#545 ) (#550 ) # What does this PR do? In title. Adds argument `--hostname` in router to support something like `--hostname ::`. Tested with ```commandline cargo run -- --port 8080 --hostname :: curl -I -X GET 'http://[::1]:8080/health' # failed before this commit ``` Trigger CI --------- Co-authored-by: Phil Chen <philchen2000@gmail.com>	2023-07-05 18:28:45 +02:00
OlivierDehaene	e28a809004	v0.9.0 (#525 )	2023-07-01 19:25:41 +02:00
OlivierDehaene	3b0c979efc	feat(router): arg validation (#519 )	2023-06-30 20:07:49 +02:00
OlivierDehaene	e74bd41e0f	feat(server): add paged attention to flash models (#516 ) Closes #478	2023-06-30 19:09:59 +02:00
Robert Kimball	70f485bf9f	feat(router): add header option to disable buffering for the generate_stream response (#498 ) # This PR adds an http header option to disable buffering for the generate_stream endpoint response stream. Problem: If a model is run behind a proxy server such as nginx that has buffering enabled then the response stream from generate_stream gets aggregated into a single response which basically disables streaming. Instead of getting a chunked response where each token is presented over time the response presents everything all at once. Solution: This change adds the `X-Accel-Buffering` http header which disables buffering for the generate_stream response, allowing the response to stream properly.	2023-06-28 11:50:12 +02:00
OlivierDehaene	bd3a9d8e85	fix(router): add timeout on flume sends (#488 )	2023-06-23 14:58:28 +02:00
OlivierDehaene	f59fb8b630	feat(router): add ngrok integration (#453 )	2023-06-16 16:25:11 +02:00
OlivierDehaene	19c41824cb	chore: update openapi schema	2023-06-05 18:16:08 +02:00
OlivierDehaene	895c5f1562	feat(server): only compute prefill logprobs when asked (#406 ) Close #288	2023-06-02 17:12:30 +02:00
OlivierDehaene	218c9adaa5	feat: decrease IPC proto size (#367 ) Closes #307 #308	2023-05-24 19:19:57 +02:00
OlivierDehaene	942005386a	feat(router): log input/ouput at debug level (#364 ) @njhill FYI	2023-05-23 20:47:37 +02:00
OlivierDehaene	5a58226130	fix(server): fix decode token (#334 ) Fixes #333 --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 23:23:27 +02:00
OlivierDehaene	68e9d6ab33	feat(server): shard token decode (#303 )	2023-05-10 15:48:21 +02:00
OlivierDehaene	e250282213	feat(docker): add benchmarking tool to docker image (#298 )	2023-05-09 13:19:31 +02:00
Sai Vinay G	926fd9a010	feat(router): Adding response schema for compat_generate (#292 )	2023-05-09 12:38:09 +02:00
Nicolas Patry	b4fe248b17	fix(launcher): handle hub branches (#278 )	2023-05-04 15:14:28 +02:00
Nicolas Patry	411b0d4e1f	chore(github): add templates (#264 )	2023-05-02 15:43:19 +02:00
Nicolas Patry	e86cca9723	Adding docs on how dynamic batching works. (#258 ) This PR starts the minimal possible amount of explanation I could think of. It tries to explain how dynamic batching occurs, the interactions with past key values and ignores the padding problem. Maybe some drawings could help too but I kept it to text for now.	2023-05-01 14:16:50 +02:00
Ehsan M. Kermani	f092ba9b22	feat(server): add watermarking tests (#248 )	2023-04-27 19:16:35 +02:00
Nicolas Patry	db2b4e0754	feat(router): new healthcheck that skips the queue (#244 ) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-04-26 20:23:54 +02:00
Nicolas Patry	c4fb09f2ae	feat(router): add tests to validation (#237 )	2023-04-26 16:14:40 +02:00
Nicolas Patry	45344244cf	Starting some routing tests. (#233 )	2023-04-25 14:13:14 +02:00
OlivierDehaene	8b182eb986	feat(router): add endpoint info to /info route (#228 )	2023-04-25 13:11:18 +02:00
OlivierDehaene	ebc74d5666	feat(router): use number of tokens in batch as input for dynamic batching (#226 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2023-04-24 17:59:00 +02:00
OlivierDehaene	6ded76a4ae	v0.6.0 (#222 )	2023-04-21 21:00:57 +02:00
OlivierDehaene	343437c7b5	feat(router): add device and dtype info (#215 )	2023-04-21 15:36:29 +02:00
OlivierDehaene	709d8936f6	feat(router): drop requests when client closes the channel (#202 )	2023-04-20 11:07:40 +02:00
OlivierDehaene	b6ee0ec7b0	feat(router): add git sha to info route (#208 )	2023-04-19 21:36:59 +02:00
OlivierDehaene	252f42c1e6	fix(router): add auth token to get model info (#207 )	2023-04-19 20:06:06 +02:00
OlivierDehaene	2475aede61	feat(router): add info route (#196 ) close #125	2023-04-18 16:16:06 +02:00
OlivierDehaene	c13b9d87c9	fix(router): fix truncation (#190 ) closes #189	2023-04-17 16:51:53 +02:00
OlivierDehaene	64347b05ff	fix(ci): fix CVE in github-slug-action (#174 )	2023-04-13 12:43:05 +02:00
OlivierDehaene	6f0f1d70f6	v0.5.0 (#168 )	2023-04-11 20:32:18 +02:00
OlivierDehaene	9987960062	feat(router): make router input validation optional (#164 )	2023-04-09 20:22:27 +02:00
OlivierDehaene	7dec65a244	fix(router): use buckets for metrics histograms (#163 )	2023-04-09 20:13:28 +02:00
OlivierDehaene	5cddc055e6	fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162 )	2023-04-09 20:07:02 +02:00
OlivierDehaene	fef1a1c381	v0.4.3 (#152 )	2023-03-30 17:28:14 +02:00
OlivierDehaene	84722f3e33	v0.4.2 (#151 )	2023-03-30 17:10:01 +02:00
OlivierDehaene	610bb1f978	feat(benchmark): tui based benchmarking tool (#149 )	2023-03-30 15:26:27 +02:00
OlivierDehaene	d503e8f09d	feat: aws sagemaker compatible image (#147 ) The only difference is that now it pushes to registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:... instead of registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-... --------- Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>	2023-03-29 21:38:30 +02:00
OlivierDehaene	f000068944	feat(server): clear cache on error (#143 )	2023-03-28 11:29:35 +02:00
OlivierDehaene	ab5fd8cf93	v0.4.1 (#140 )	2023-03-26 16:37:51 +02:00
OlivierDehaene	b49dbf2d88	fix(server): use server tokenizer as gt (#128 )	2023-03-16 12:12:26 +01:00
OlivierDehaene	cbd36aa4d1	fix(server): revert gpt-neox optims (#123 )	2023-03-13 22:57:08 +01:00
OlivierDehaene	411d6247f4	v0.4.0 (#119 )	2023-03-09 16:07:01 +01:00
OlivierDehaene	55bd4fed7d	feat(router): add best_of parameter (#117 )	2023-03-09 15:30:54 +01:00
OlivierDehaene	e8bfe199ba	feat(router): support left truncation (#115 ) closes #111	2023-03-09 13:10:30 +01:00
OlivierDehaene	1a2d68250a	feat: support typical sampling (#114 ) closes #112	2023-03-09 11:33:57 +01:00
OlivierDehaene	3fef90d50f	feat(clients): Python client (#103 )	2023-03-07 18:52:22 +01:00
OlivierDehaene	cd5961b5da	feat: allow local models (#101 ) closes #99	2023-03-06 14:39:36 +01:00
OlivierDehaene	1c19b0934e	v0.3.2 (#97 )	2023-03-03 18:42:20 +01:00
OlivierDehaene	9b8ea6a6c7	feat(server): add logits watermark (#90 )	2023-03-02 12:30:41 +01:00
OlivierDehaene	f874c47831	feat(router): add api-inference headers (#91 )	2023-03-02 11:41:51 +01:00
OlivierDehaene	4e685d907e	feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89 )	2023-02-28 10:19:32 +01:00
OlivierDehaene	21340f24ba	feat(router): add legacy route for api-inference support (#88 )	2023-02-27 14:56:58 +01:00
OlivierDehaene	0ac184ce77	feat(server): add special token bool (#85 )	2023-02-24 15:55:57 +01:00
OlivierDehaene	4b1c9720c0	v0.3.1 (#84 )	2023-02-24 13:27:41 +01:00
OlivierDehaene	6796d38c6d	feat(router): add cors allow origin options (#73 )	2023-02-17 18:22:00 +01:00
OlivierDehaene	c720555adc	v0.3.0 (#72 )	2023-02-16 17:28:29 +01:00
OlivierDehaene	439fcaf810	feat(router): add prometheus metrics scrape endpoint (#71 )	2023-02-16 17:18:53 +01:00
OlivierDehaene	5437d49beb	feat(router): add max_total_tokens and empty_input validation (#68 ) closes #65	2023-02-15 21:56:59 +01:00
OlivierDehaene	9af454142a	feat: add distributed tracing (#62 )	2023-02-13 13:02:45 +01:00
Yannic Kilcher	e520d5b349	fixed SSE naming (#61 ) https://en.wikipedia.org/wiki/Server-sent_events	2023-02-08 22:30:11 +01:00
OlivierDehaene	2fe5e1b30e	V0.2.1 (#58 )	2023-02-07 15:40:25 +01:00
OlivierDehaene	20c3c5940c	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
OlivierDehaene	b1482d9048	breaking(router): modify /generate API to only return generated text (#50 ) @njhill, @yk FYI generated_text was concatenated to the user prompt for legacy reason. We want to remove this behaviour as we don't think it is useful and even detrimonial to usability. We also remove the unused Vec.	2023-02-02 15:02:04 +01:00
OlivierDehaene	7b870e1e18	feat(router): use background task to manage request queue (#52 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2023-02-02 14:59:27 +01:00
OlivierDehaene	313194f6d7	feat(server): support repetition penalty (#47 )	2023-02-01 15:58:42 +01:00
OlivierDehaene	017a2a8c2f	feat: Add token streaming using ServerSideEvents support (#41 )	2023-01-31 17:04:00 +01:00
OlivierDehaene	54fec93193	fix(server): fix seeding with multiple shards (#44 )	2023-01-31 16:01:15 +01:00
OlivierDehaene	4f9ac67cfa	Revert "feat: Add token streaming using ServerSideEvents support" (#40 ) Reverts huggingface/text-generation-inference#36	2023-01-31 14:21:51 +01:00
OlivierDehaene	7fbfbb0dc5	feat: Add token streaming using ServerSideEvents support (#36 ) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ```	2023-01-31 11:49:43 +01:00
OlivierDehaene	cd298bc5e5	feat: Support sampling seeding (#37 ) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>	2023-01-30 15:36:16 +01:00
OlivierDehaene	1539d3cbbe	feat(router): Remove second lock from batcher hot path (#27 ) @njhill	2023-01-26 16:29:13 +01:00
OlivierDehaene	5c01e2544c	fix(router): fix api-inference deployment (#31 )	2023-01-23 17:42:14 +01:00
OlivierDehaene	f9d0ec376a	feat(docker): Make the image compatible with api-inference (#29 )	2023-01-23 17:11:27 +01:00
OlivierDehaene	15511edc01	feat(server): Support SantaCoder (#26 )	2023-01-20 12:24:39 +01:00
Nick Hill	f7ac394935	fix(router): Obey max batch size (#23 )	2023-01-17 09:11:21 +01:00
Nick Hill	e6d3eb5d5d	fix(server): Minor refactorization using new_zeros (#24 ) - Fix some type hints, in particular base tokenizer class - Make use of `tensor.new_zero/empty` methods - Simplify env var string parsing in launcher	2023-01-17 09:10:22 +01:00
Nick Hill	60472f9d2b	feat(router): Add const parameters to validation logic (#15 ) I noticed some opportunity to collapse some of the logic, in case you are interested.	2023-01-03 10:41:22 +01:00
Nick Hill	3efa5bbbfd	fix(router): Include special tokens when tokenizing (#14 ) There's currently a discrepancy in the tokenization between the router and python server code. The latter includes special tokens but former does not. This results in a token count mismatch for seq2seq models such as mt0 where the tokenizer emits an EOS token at the end. This in turn results in some unexpected/incorrect output, in particular when batch concatenation is involved, because the python code uses the input length passed from the router for each row. As far as I can tell, it is better to include this token in the encoder `input_ids`, so I guess it's best to just adjust on the router side.	2022-12-30 19:31:44 +01:00
OlivierDehaene	32a253063d	feat: Return logprobs (#8 )	2022-12-15 17:03:56 +01:00
OlivierDehaene	718096f695	feat: Support stop sequences (#7 )	2022-12-12 18:25:22 +01:00
OlivierDehaene	a2985036aa	feat(server): Add model tests (#6 )	2022-12-08 18:49:33 +01:00
Nick Hill	31d76e238d	fix(batching): Avoid theoretical hang in batcher loop (#5 ) - Avoid theoretical hang in batcher loop - Avoid a couple of clones in the router generate method - Keep attention mask tensors as integers - Remove num_heads attribute Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>	2022-12-05 10:10:59 +01:00

1 2 3 4 5 ...

270 Commits