hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
drbh	3ccb3bb0b5	feat: support raise_exception, bos and eos tokens (#1450 ) This PR adds support to handle the custom jinja function `raise_exception` and passes the `bos` and `eos` tokens into the template Additionally this PR adds 3 tests to validate and show examples of what can and cannot be parsed currently. ```bash cargo test --package text-generation-router --lib -- infer::tests --nocapture # Finished test [unoptimized + debuginfo] target(s) in 7.82s # Running unittests src/lib.rs (target/debug/deps/text_generation_router-18a0bbf99c2ca1b4) # running 3 tests # test infer::tests::test_chat_template_valid_with_raise ... ok # test infer::tests::test_chat_template ... ok # test infer::tests::test_chat_template_invalid_with_raise ... ok # test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 15 filtered out; finished in 0.00s ```	2024-01-18 12:31:56 +01:00
drbh	0eabc83541	feat: supports openai chat completions API (#1427 ) This PR adds support to make TGI a drop in replacement for OpenAI clients by exposing the same HTTP interface. Notes - TGI inits a single model at startup so the `model` field is unused in HTTP requests. - `max_tokens` and `stream` should work as expected but other params may be (unimplemented or not supported) General approach - fetch the `tokenizer_config` at startup from the hub - pass `tokenizer_config` into `Infer` so we have it at request time - use the `chat_template` on the config to format chat request - parse jinja template and render chat string - pass inputs into existing generate function - wrap generation output in expected structure before returning # How to test ### Streaming curl ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "tgi", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is deep learning?" } ], "stream": true, "max_tokens": 20 }' \ -H 'Content-Type: application/json' ``` It is also possible to use the `openai` python library and change the base url ### 🌊 STREAMING REQUEST ```python from openai import OpenAI # init the client but point it to TGI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=True ) # iterate and print stream for message in chat_completion: print(message) # ChatCompletionChunk(id='', choices=[Choice(delta=ChoiceDelta(content=' that', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=2, logprobs=None)], created=1704486761, model='', object='text_completion', system_fingerprint='') ``` ### 🚗 SYNCHRONOUS REQUEST ```python from openai import OpenAI # init the client but point it to TGI client = OpenAI( base_url="http://localhost:3000/v1", api_key="not needed for a local LLM" ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant." }, {"role": "user", "content": "What is deep learning?"} ], stream=False ) print(chat_completion) # ChatCompletion(id='', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='\nDeep learning is a new field of research that has been gaining traction in the last ...', role='assistant', function_call=None, tool_calls=None))], created=1704486762, model='', object='text_completion', system_fingerprint='', usage=CompletionUsage(completion_tokens=100, prompt_tokens=76, total_tokens=176)) ``` ## How to run dev ```bash cd text-generation-inference/server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 text-generation-server serve --trust-remote-code gpt2 ``` ***note many of the existing `chat_templates` use non standard `jinja` (ie. adding a `raise` to the template) which will throw an error when parsing; hence using `upstage/SOLAR-10.7B-Instruct-v1.0` since it has a valid template ```bash cd text-generation-inference/router cargo run -- --tokenizer-name upstage/SOLAR-10.7B-Instruct-v1.0 ``` trigger ```bash curl localhost:3000/v1/chat/completions \ -X POST \ -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the IP address of the Google DNS servers?" } ], "stream": true, "max_tokens": 20, "logprobs": true }' \ -H 'Content-Type: application/json' ``` ^ supports `stream: true` and `stream: false` requests	2024-01-16 11:07:41 +01:00
Nicolas Patry	ac08b4ef9c	Return prompt vs generated tokens. (#1436 ) # What does this PR do? Fixes #637 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-01-11 13:01:43 -05:00
OlivierDehaene	fbeb1c4475	fix: follow base model for tokenizer in router (#1424 ) Close #1422	2024-01-10 16:35:54 +01:00
OlivierDehaene	d077150eb7	fix: fix gpt-q with groupsize = -1 (#1358 )	2023-12-18 16:07:05 +01:00
OlivierDehaene	8428ed1011	fix: fix offline (#1341 ) (#1347 ) @oOraph --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>	2023-12-18 10:20:08 +01:00
OlivierDehaene	50b495f3d8	feat: add more latency metrics in forward (#1346 )	2023-12-14 15:59:38 +01:00
OlivierDehaene	28821bfd5d	fix: default max_new_tokens to 100	2023-12-13 09:19:19 +01:00
OlivierDehaene	3a521c92b3	feat: mixtral (#1328 )	2023-12-11 14:43:40 +01:00
Nicolas Patry	9ecfa16b12	Speculative (#1308 )	2023-12-11 12:46:30 +01:00
Nicolas Patry	ed2a3f617e	Exllama v2 (#1211 ) # What does this PR do? See #1165 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Florian Zimmermeister <flozi00.fz@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-24-153.ec2.internal>	2023-11-25 22:38:38 +01:00
Nicolas Patry	3c02262f29	Reduce race condition on file system for test	2023-11-23 15:42:48 +00:00
OlivierDehaene	3dbc649b11	fix: do not leak inputs on error (#1228 ) Close #1225	2023-11-20 10:33:44 +01:00
OlivierDehaene	f9910d13e2	feat: remove flume (#1184 )	2023-10-23 15:51:12 +02:00
OlivierDehaene	5e28f44a83	#1049 CI (#1178 ) See #1049 --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com>	2023-10-20 10:28:45 +02:00
OlivierDehaene	20ee71dcf5	fix: force one of max_new_tokens or truncate with slow tokenizer	2023-10-11 10:46:40 +02:00
Nicolas Patry	6df43da0a4	Modify the default for `max_new_tokens`. (#1097 ) # What does this PR do? Now clients which do not specify a max_length will be implying `max_new_tokens = max_total_tokens - input_length`. This is a serious change, but which seems more in line with what users expect from standing server. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-10-04 17:38:42 +02:00
OlivierDehaene	3b56d7669b	feat: add mistral model (#1071 )	2023-09-28 09:55:47 +02:00
Nicolas Patry	a049864270	Preping 1.1.0 (#1066 ) # What does this PR do? Upgrade all relevant versions and dependencies. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-09-27 10:40:18 +02:00
Nicolas Patry	211b54ac41	Rebased #617 (#868 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Vincent Brouwers <vincent.brouwers@ing.com>	2023-08-28 11:43:47 +02:00
Nicolas Patry	05dd14fdb9	Fix `tokenizers==0.13.4` . (#838 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-08-14 19:26:19 +02:00
ivarflakstad	8bdb16ee9a	Use destructuring in router arguments to avoid '.0' (#798 ) # What does this PR do? This is purely code style - not anything important. Instead of writing `req.0` all over we can use [descructuring](https://doc.rust-lang.org/rust-by-example/flow_control/match/destructuring/destructure_structures.html) to access the contained value that we actually want. (Destructuring in function parameters [here](https://doc.rust-lang.org/reference/items/functions.html#function-parameters)) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene	2023-08-10 10:52:50 +02:00
OlivierDehaene	afd04dc71e	feat(server): update vllm version (#723 )	2023-07-28 15:36:38 +02:00
OlivierDehaene	73a4d65d26	feat: add cuda memory fraction (#659 ) Close #673	2023-07-24 11:43:58 +02:00
OlivierDehaene	1da642bd0e	feat(server): add local prom and health routes if running w/ ngrok	2023-07-21 16:56:30 +02:00
OlivierDehaene	b66b190403	feat(router): ngrok edge (#642 )	2023-07-19 11:59:58 +02:00
OlivierDehaene	fe80f5360c	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
OlivierDehaene	982ce3227b	feat(router): explicit warning if revision is not set (#608 )	2023-07-13 18:59:38 +02:00
OlivierDehaene	b7327205a6	feat(launcher): add arg validation and drop subprocess (#595 )	2023-07-13 14:22:37 +02:00
OlivierDehaene	b4024edd45	feat: better errors for warmup and TP (#575 ) Close #571	2023-07-10 14:47:15 +02:00
OlivierDehaene	6f42942772	feat(router): add argument for hostname in router (#545 ) (#550 ) # What does this PR do? In title. Adds argument `--hostname` in router to support something like `--hostname ::`. Tested with ```commandline cargo run -- --port 8080 --hostname :: curl -I -X GET 'http://[::1]:8080/health' # failed before this commit ``` Trigger CI --------- Co-authored-by: Phil Chen <philchen2000@gmail.com>	2023-07-05 18:28:45 +02:00
OlivierDehaene	e28a809004	v0.9.0 (#525 )	2023-07-01 19:25:41 +02:00
OlivierDehaene	3b0c979efc	feat(router): arg validation (#519 )	2023-06-30 20:07:49 +02:00
OlivierDehaene	e74bd41e0f	feat(server): add paged attention to flash models (#516 ) Closes #478	2023-06-30 19:09:59 +02:00
Robert Kimball	70f485bf9f	feat(router): add header option to disable buffering for the generate_stream response (#498 ) # This PR adds an http header option to disable buffering for the generate_stream endpoint response stream. Problem: If a model is run behind a proxy server such as nginx that has buffering enabled then the response stream from generate_stream gets aggregated into a single response which basically disables streaming. Instead of getting a chunked response where each token is presented over time the response presents everything all at once. Solution: This change adds the `X-Accel-Buffering` http header which disables buffering for the generate_stream response, allowing the response to stream properly.	2023-06-28 11:50:12 +02:00
OlivierDehaene	bd3a9d8e85	fix(router): add timeout on flume sends (#488 )	2023-06-23 14:58:28 +02:00
OlivierDehaene	f59fb8b630	feat(router): add ngrok integration (#453 )	2023-06-16 16:25:11 +02:00
OlivierDehaene	19c41824cb	chore: update openapi schema	2023-06-05 18:16:08 +02:00
OlivierDehaene	895c5f1562	feat(server): only compute prefill logprobs when asked (#406 ) Close #288	2023-06-02 17:12:30 +02:00
OlivierDehaene	218c9adaa5	feat: decrease IPC proto size (#367 ) Closes #307 #308	2023-05-24 19:19:57 +02:00
OlivierDehaene	942005386a	feat(router): log input/ouput at debug level (#364 ) @njhill FYI	2023-05-23 20:47:37 +02:00
OlivierDehaene	5a58226130	fix(server): fix decode token (#334 ) Fixes #333 --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 23:23:27 +02:00
OlivierDehaene	68e9d6ab33	feat(server): shard token decode (#303 )	2023-05-10 15:48:21 +02:00
OlivierDehaene	e250282213	feat(docker): add benchmarking tool to docker image (#298 )	2023-05-09 13:19:31 +02:00
Sai Vinay G	926fd9a010	feat(router): Adding response schema for compat_generate (#292 )	2023-05-09 12:38:09 +02:00
Nicolas Patry	b4fe248b17	fix(launcher): handle hub branches (#278 )	2023-05-04 15:14:28 +02:00
Nicolas Patry	411b0d4e1f	chore(github): add templates (#264 )	2023-05-02 15:43:19 +02:00
Nicolas Patry	e86cca9723	Adding docs on how dynamic batching works. (#258 ) This PR starts the minimal possible amount of explanation I could think of. It tries to explain how dynamic batching occurs, the interactions with past key values and ignores the padding problem. Maybe some drawings could help too but I kept it to text for now.	2023-05-01 14:16:50 +02:00
Ehsan M. Kermani	f092ba9b22	feat(server): add watermarking tests (#248 )	2023-04-27 19:16:35 +02:00
Nicolas Patry	db2b4e0754	feat(router): new healthcheck that skips the queue (#244 ) Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-04-26 20:23:54 +02:00
Nicolas Patry	c4fb09f2ae	feat(router): add tests to validation (#237 )	2023-04-26 16:14:40 +02:00
Nicolas Patry	45344244cf	Starting some routing tests. (#233 )	2023-04-25 14:13:14 +02:00
OlivierDehaene	8b182eb986	feat(router): add endpoint info to /info route (#228 )	2023-04-25 13:11:18 +02:00
OlivierDehaene	ebc74d5666	feat(router): use number of tokens in batch as input for dynamic batching (#226 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2023-04-24 17:59:00 +02:00
OlivierDehaene	6ded76a4ae	v0.6.0 (#222 )	2023-04-21 21:00:57 +02:00
OlivierDehaene	343437c7b5	feat(router): add device and dtype info (#215 )	2023-04-21 15:36:29 +02:00
OlivierDehaene	709d8936f6	feat(router): drop requests when client closes the channel (#202 )	2023-04-20 11:07:40 +02:00
OlivierDehaene	b6ee0ec7b0	feat(router): add git sha to info route (#208 )	2023-04-19 21:36:59 +02:00
OlivierDehaene	252f42c1e6	fix(router): add auth token to get model info (#207 )	2023-04-19 20:06:06 +02:00
OlivierDehaene	2475aede61	feat(router): add info route (#196 ) close #125	2023-04-18 16:16:06 +02:00
OlivierDehaene	c13b9d87c9	fix(router): fix truncation (#190 ) closes #189	2023-04-17 16:51:53 +02:00
OlivierDehaene	64347b05ff	fix(ci): fix CVE in github-slug-action (#174 )	2023-04-13 12:43:05 +02:00
OlivierDehaene	6f0f1d70f6	v0.5.0 (#168 )	2023-04-11 20:32:18 +02:00
OlivierDehaene	9987960062	feat(router): make router input validation optional (#164 )	2023-04-09 20:22:27 +02:00
OlivierDehaene	7dec65a244	fix(router): use buckets for metrics histograms (#163 )	2023-04-09 20:13:28 +02:00
OlivierDehaene	5cddc055e6	fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162 )	2023-04-09 20:07:02 +02:00
OlivierDehaene	fef1a1c381	v0.4.3 (#152 )	2023-03-30 17:28:14 +02:00
OlivierDehaene	84722f3e33	v0.4.2 (#151 )	2023-03-30 17:10:01 +02:00
OlivierDehaene	610bb1f978	feat(benchmark): tui based benchmarking tool (#149 )	2023-03-30 15:26:27 +02:00
OlivierDehaene	d503e8f09d	feat: aws sagemaker compatible image (#147 ) The only difference is that now it pushes to registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:... instead of registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-... --------- Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>	2023-03-29 21:38:30 +02:00
OlivierDehaene	f000068944	feat(server): clear cache on error (#143 )	2023-03-28 11:29:35 +02:00
OlivierDehaene	ab5fd8cf93	v0.4.1 (#140 )	2023-03-26 16:37:51 +02:00
OlivierDehaene	b49dbf2d88	fix(server): use server tokenizer as gt (#128 )	2023-03-16 12:12:26 +01:00
OlivierDehaene	cbd36aa4d1	fix(server): revert gpt-neox optims (#123 )	2023-03-13 22:57:08 +01:00
OlivierDehaene	411d6247f4	v0.4.0 (#119 )	2023-03-09 16:07:01 +01:00
OlivierDehaene	55bd4fed7d	feat(router): add best_of parameter (#117 )	2023-03-09 15:30:54 +01:00
OlivierDehaene	e8bfe199ba	feat(router): support left truncation (#115 ) closes #111	2023-03-09 13:10:30 +01:00
OlivierDehaene	1a2d68250a	feat: support typical sampling (#114 ) closes #112	2023-03-09 11:33:57 +01:00
OlivierDehaene	3fef90d50f	feat(clients): Python client (#103 )	2023-03-07 18:52:22 +01:00
OlivierDehaene	cd5961b5da	feat: allow local models (#101 ) closes #99	2023-03-06 14:39:36 +01:00
OlivierDehaene	1c19b0934e	v0.3.2 (#97 )	2023-03-03 18:42:20 +01:00
OlivierDehaene	9b8ea6a6c7	feat(server): add logits watermark (#90 )	2023-03-02 12:30:41 +01:00
OlivierDehaene	f874c47831	feat(router): add api-inference headers (#91 )	2023-03-02 11:41:51 +01:00
OlivierDehaene	4e685d907e	feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89 )	2023-02-28 10:19:32 +01:00
OlivierDehaene	21340f24ba	feat(router): add legacy route for api-inference support (#88 )	2023-02-27 14:56:58 +01:00
OlivierDehaene	0ac184ce77	feat(server): add special token bool (#85 )	2023-02-24 15:55:57 +01:00
OlivierDehaene	4b1c9720c0	v0.3.1 (#84 )	2023-02-24 13:27:41 +01:00
OlivierDehaene	6796d38c6d	feat(router): add cors allow origin options (#73 )	2023-02-17 18:22:00 +01:00
OlivierDehaene	c720555adc	v0.3.0 (#72 )	2023-02-16 17:28:29 +01:00
OlivierDehaene	439fcaf810	feat(router): add prometheus metrics scrape endpoint (#71 )	2023-02-16 17:18:53 +01:00
OlivierDehaene	5437d49beb	feat(router): add max_total_tokens and empty_input validation (#68 ) closes #65	2023-02-15 21:56:59 +01:00
OlivierDehaene	9af454142a	feat: add distributed tracing (#62 )	2023-02-13 13:02:45 +01:00
Yannic Kilcher	e520d5b349	fixed SSE naming (#61 ) https://en.wikipedia.org/wiki/Server-sent_events	2023-02-08 22:30:11 +01:00
OlivierDehaene	2fe5e1b30e	V0.2.1 (#58 )	2023-02-07 15:40:25 +01:00
OlivierDehaene	20c3c5940c	feat(router): refactor API and add openAPI schemas (#53 )	2023-02-03 12:43:37 +01:00
OlivierDehaene	b1482d9048	breaking(router): modify /generate API to only return generated text (#50 ) @njhill, @yk FYI generated_text was concatenated to the user prompt for legacy reason. We want to remove this behaviour as we don't think it is useful and even detrimonial to usability. We also remove the unused Vec.	2023-02-02 15:02:04 +01:00
OlivierDehaene	7b870e1e18	feat(router): use background task to manage request queue (#52 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2023-02-02 14:59:27 +01:00
OlivierDehaene	313194f6d7	feat(server): support repetition penalty (#47 )	2023-02-01 15:58:42 +01:00
OlivierDehaene	017a2a8c2f	feat: Add token streaming using ServerSideEvents support (#41 )	2023-01-31 17:04:00 +01:00
OlivierDehaene	54fec93193	fix(server): fix seeding with multiple shards (#44 )	2023-01-31 16:01:15 +01:00
OlivierDehaene	4f9ac67cfa	Revert "feat: Add token streaming using ServerSideEvents support" (#40 ) Reverts huggingface/text-generation-inference#36	2023-01-31 14:21:51 +01:00
OlivierDehaene	7fbfbb0dc5	feat: Add token streaming using ServerSideEvents support (#36 ) Add token streaming using ServerSideEvents (SSE). The signature of the SSE events is: ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option<u64>, } struct StreamResponse { token: Token, generated_text: Option<String>, details: Option<Details>, } struct ErrorResponse { error: String, } ```	2023-01-31 11:49:43 +01:00
OlivierDehaene	cd298bc5e5	feat: Support sampling seeding (#37 ) Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>	2023-01-30 15:36:16 +01:00
OlivierDehaene	1539d3cbbe	feat(router): Remove second lock from batcher hot path (#27 ) @njhill	2023-01-26 16:29:13 +01:00
OlivierDehaene	5c01e2544c	fix(router): fix api-inference deployment (#31 )	2023-01-23 17:42:14 +01:00
OlivierDehaene	f9d0ec376a	feat(docker): Make the image compatible with api-inference (#29 )	2023-01-23 17:11:27 +01:00
OlivierDehaene	15511edc01	feat(server): Support SantaCoder (#26 )	2023-01-20 12:24:39 +01:00
Nick Hill	f7ac394935	fix(router): Obey max batch size (#23 )	2023-01-17 09:11:21 +01:00
Nick Hill	e6d3eb5d5d	fix(server): Minor refactorization using new_zeros (#24 ) - Fix some type hints, in particular base tokenizer class - Make use of `tensor.new_zero/empty` methods - Simplify env var string parsing in launcher	2023-01-17 09:10:22 +01:00
Nick Hill	60472f9d2b	feat(router): Add const parameters to validation logic (#15 ) I noticed some opportunity to collapse some of the logic, in case you are interested.	2023-01-03 10:41:22 +01:00
Nick Hill	3efa5bbbfd	fix(router): Include special tokens when tokenizing (#14 ) There's currently a discrepancy in the tokenization between the router and python server code. The latter includes special tokens but former does not. This results in a token count mismatch for seq2seq models such as mt0 where the tokenizer emits an EOS token at the end. This in turn results in some unexpected/incorrect output, in particular when batch concatenation is involved, because the python code uses the input length passed from the router for each row. As far as I can tell, it is better to include this token in the encoder `input_ids`, so I guess it's best to just adjust on the router side.	2022-12-30 19:31:44 +01:00
OlivierDehaene	32a253063d	feat: Return logprobs (#8 )	2022-12-15 17:03:56 +01:00
OlivierDehaene	718096f695	feat: Support stop sequences (#7 )	2022-12-12 18:25:22 +01:00
OlivierDehaene	a2985036aa	feat(server): Add model tests (#6 )	2022-12-08 18:49:33 +01:00
Nick Hill	31d76e238d	fix(batching): Avoid theoretical hang in batcher loop (#5 ) - Avoid theoretical hang in batcher loop - Avoid a couple of clones in the router generate method - Keep attention mask tensors as integers - Remove num_heads attribute Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>	2022-12-05 10:10:59 +01:00
OlivierDehaene	d6d5b12e03	fix(router): Handle tokenizer errors	2022-11-14 17:15:19 +01:00
OlivierDehaene	91f5f86280	fix(router): Fix HTTP status codes	2022-11-14 14:34:15 +01:00
OlivierDehaene	427d7cc444	feat(server): Support AutoModelForSeq2SeqLM	2022-11-04 18:03:04 +01:00
OlivierDehaene	c5665f5c8b	feat(server): Support generic AutoModelForCausalLM	2022-11-04 14:22:47 +01:00
OlivierDehaene	b3b7ea0d74	feat: Use json formatter by default in docker image	2022-11-02 17:29:56 +01:00
OlivierDehaene	3cf6368c77	feat(server): Support all AutoModelForCausalLM on a best effort basis	2022-10-28 19:24:00 +02:00
OlivierDehaene	09674e6df9	feat(server): Support bitsandbytes	2022-10-27 14:25:29 +02:00
OlivierDehaene	beb552127a	feat(client): Simplify sharded logic	2022-10-22 23:40:05 +02:00
OlivierDehaene	c837893370	feat(router): Add max_waiting_tokens	2022-10-21 16:40:05 +02:00
OlivierDehaene	895a341d06	fix(validation): Fix error messages	2022-10-21 10:59:15 +02:00
Olivier Dehaene	f16f2f5ae1	v0.1.0	2022-10-20 19:14:44 +02:00
Olivier Dehaene	92c1ecd008	feat: Add arguments to CLI	2022-10-17 18:27:33 +02:00
Olivier Dehaene	5e5d8766a2	feat: Improve error handling	2022-10-17 14:59:00 +02:00
Olivier Dehaene	bcb53903b8	feat: Add AML deployment	2022-10-15 20:21:50 +02:00
Olivier Dehaene	bf99afe916	feat: Docker image	2022-10-14 15:56:21 +02:00
Olivier Dehaene	39df4d9975	Use axum	2022-10-11 18:14:39 +02:00
Olivier Dehaene	e86ecbac63	ValidationError was not correctly handled	2022-10-11 16:53:40 +02:00
Olivier Dehaene	4c693e6524	Refactored gRPC interface Added validation logic	2022-10-11 16:50:54 +02:00
Olivier Dehaene	fa9a088467	Add load testing	2022-10-11 10:36:51 +02:00
Olivier Dehaene	295831a481	Init	2022-10-08 12:30:12 +02:00

1 2 3 4 5

235 Commits