Commit Graph

208 Commits

Author SHA1 Message Date
OlivierDehaene 982ce3227b
feat(router): explicit warning if revision is not set (#608) 2023-07-13 18:59:38 +02:00
OlivierDehaene b7327205a6
feat(launcher): add arg validation and drop subprocess (#595) 2023-07-13 14:22:37 +02:00
OlivierDehaene b4024edd45
feat: better errors for warmup and TP (#575)
Close #571
2023-07-10 14:47:15 +02:00
OlivierDehaene 6f42942772
feat(router): add argument for hostname in router (#545) (#550)
# What does this PR do?

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>
2023-07-05 18:28:45 +02:00
OlivierDehaene e28a809004
v0.9.0 (#525) 2023-07-01 19:25:41 +02:00
OlivierDehaene 3b0c979efc
feat(router): arg validation (#519) 2023-06-30 20:07:49 +02:00
OlivierDehaene e74bd41e0f
feat(server): add paged attention to flash models (#516)
Closes #478
2023-06-30 19:09:59 +02:00
Robert Kimball 70f485bf9f
feat(router): add header option to disable buffering for the generate_stream response (#498)
# This PR adds an http header option to disable buffering for the
generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
2023-06-28 11:50:12 +02:00
OlivierDehaene bd3a9d8e85
fix(router): add timeout on flume sends (#488) 2023-06-23 14:58:28 +02:00
OlivierDehaene f59fb8b630
feat(router): add ngrok integration (#453) 2023-06-16 16:25:11 +02:00
OlivierDehaene 19c41824cb chore: update openapi schema 2023-06-05 18:16:08 +02:00
OlivierDehaene 895c5f1562
feat(server): only compute prefill logprobs when asked (#406)
Close #288
2023-06-02 17:12:30 +02:00
OlivierDehaene 218c9adaa5
feat: decrease IPC proto size (#367)
Closes #307 #308
2023-05-24 19:19:57 +02:00
OlivierDehaene 942005386a
feat(router): log input/ouput at debug level (#364)
@njhill FYI
2023-05-23 20:47:37 +02:00
OlivierDehaene 5a58226130
fix(server): fix decode token (#334)
Fixes #333

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2023-05-16 23:23:27 +02:00
OlivierDehaene 68e9d6ab33
feat(server): shard token decode (#303) 2023-05-10 15:48:21 +02:00
OlivierDehaene e250282213
feat(docker): add benchmarking tool to docker image (#298) 2023-05-09 13:19:31 +02:00
Sai Vinay G 926fd9a010
feat(router): Adding response schema for compat_generate (#292) 2023-05-09 12:38:09 +02:00
Nicolas Patry b4fe248b17
fix(launcher): handle hub branches (#278) 2023-05-04 15:14:28 +02:00
Nicolas Patry 411b0d4e1f
chore(github): add templates (#264) 2023-05-02 15:43:19 +02:00
Nicolas Patry e86cca9723
Adding docs on how dynamic batching works. (#258)
This PR starts the minimal possible amount of explanation I could think
of. It tries to explain how dynamic batching occurs, the interactions
with past key values and ignores the padding problem.

Maybe some drawings could help too but I kept it to text for now.
2023-05-01 14:16:50 +02:00
Ehsan M. Kermani f092ba9b22
feat(server): add watermarking tests (#248) 2023-04-27 19:16:35 +02:00
Nicolas Patry db2b4e0754
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2023-04-26 20:23:54 +02:00
Nicolas Patry c4fb09f2ae
feat(router): add tests to validation (#237) 2023-04-26 16:14:40 +02:00
Nicolas Patry 45344244cf
Starting some routing tests. (#233) 2023-04-25 14:13:14 +02:00
OlivierDehaene 8b182eb986
feat(router): add endpoint info to /info route (#228) 2023-04-25 13:11:18 +02:00
OlivierDehaene ebc74d5666
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-04-24 17:59:00 +02:00
OlivierDehaene 6ded76a4ae
v0.6.0 (#222) 2023-04-21 21:00:57 +02:00
OlivierDehaene 343437c7b5
feat(router): add device and dtype info (#215) 2023-04-21 15:36:29 +02:00
OlivierDehaene 709d8936f6
feat(router): drop requests when client closes the channel (#202) 2023-04-20 11:07:40 +02:00
OlivierDehaene b6ee0ec7b0
feat(router): add git sha to info route (#208) 2023-04-19 21:36:59 +02:00
OlivierDehaene 252f42c1e6
fix(router): add auth token to get model info (#207) 2023-04-19 20:06:06 +02:00
OlivierDehaene 2475aede61
feat(router): add info route (#196)
close #125
2023-04-18 16:16:06 +02:00
OlivierDehaene c13b9d87c9
fix(router): fix truncation (#190)
closes #189
2023-04-17 16:51:53 +02:00
OlivierDehaene 64347b05ff
fix(ci): fix CVE in github-slug-action (#174) 2023-04-13 12:43:05 +02:00
OlivierDehaene 6f0f1d70f6
v0.5.0 (#168) 2023-04-11 20:32:18 +02:00
OlivierDehaene 9987960062
feat(router): make router input validation optional (#164) 2023-04-09 20:22:27 +02:00
OlivierDehaene 7dec65a244
fix(router): use buckets for metrics histograms (#163) 2023-04-09 20:13:28 +02:00
OlivierDehaene 5cddc055e6
fix(rust-client): use join_all instead of select_all to hopefully fix nccl issues (#162) 2023-04-09 20:07:02 +02:00
OlivierDehaene fef1a1c381
v0.4.3 (#152) 2023-03-30 17:28:14 +02:00
OlivierDehaene 84722f3e33
v0.4.2 (#151) 2023-03-30 17:10:01 +02:00
OlivierDehaene 610bb1f978
feat(benchmark): tui based benchmarking tool (#149) 2023-03-30 15:26:27 +02:00
OlivierDehaene d503e8f09d
feat: aws sagemaker compatible image (#147)
The only difference is that now it pushes to
registry.internal.huggingface.tech/api-inference/community/text-generation-inference/sagemaker:...
instead of
registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sagemaker-...

---------

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
2023-03-29 21:38:30 +02:00
OlivierDehaene f000068944
feat(server): clear cache on error (#143) 2023-03-28 11:29:35 +02:00
OlivierDehaene ab5fd8cf93
v0.4.1 (#140) 2023-03-26 16:37:51 +02:00
OlivierDehaene b49dbf2d88
fix(server): use server tokenizer as gt (#128) 2023-03-16 12:12:26 +01:00
OlivierDehaene cbd36aa4d1
fix(server): revert gpt-neox optims (#123) 2023-03-13 22:57:08 +01:00
OlivierDehaene 411d6247f4
v0.4.0 (#119) 2023-03-09 16:07:01 +01:00
OlivierDehaene 55bd4fed7d
feat(router): add best_of parameter (#117) 2023-03-09 15:30:54 +01:00
OlivierDehaene e8bfe199ba
feat(router): support left truncation (#115)
closes #111
2023-03-09 13:10:30 +01:00
OlivierDehaene 1a2d68250a
feat: support typical sampling (#114)
closes #112
2023-03-09 11:33:57 +01:00
OlivierDehaene 3fef90d50f
feat(clients): Python client (#103) 2023-03-07 18:52:22 +01:00
OlivierDehaene cd5961b5da
feat: allow local models (#101)
closes #99
2023-03-06 14:39:36 +01:00
OlivierDehaene 1c19b0934e
v0.3.2 (#97) 2023-03-03 18:42:20 +01:00
OlivierDehaene 9b8ea6a6c7
feat(server): add logits watermark (#90) 2023-03-02 12:30:41 +01:00
OlivierDehaene f874c47831
feat(router): add api-inference headers (#91) 2023-03-02 11:41:51 +01:00
OlivierDehaene 4e685d907e
feat(router): ask hf.co for pipelinetag to decide on compat_return_full_text (#89) 2023-02-28 10:19:32 +01:00
OlivierDehaene 21340f24ba
feat(router): add legacy route for api-inference support (#88) 2023-02-27 14:56:58 +01:00
OlivierDehaene 0ac184ce77
feat(server): add special token bool (#85) 2023-02-24 15:55:57 +01:00
OlivierDehaene 4b1c9720c0
v0.3.1 (#84) 2023-02-24 13:27:41 +01:00
OlivierDehaene 6796d38c6d
feat(router): add cors allow origin options (#73) 2023-02-17 18:22:00 +01:00
OlivierDehaene c720555adc
v0.3.0 (#72) 2023-02-16 17:28:29 +01:00
OlivierDehaene 439fcaf810
feat(router): add prometheus metrics scrape endpoint (#71) 2023-02-16 17:18:53 +01:00
OlivierDehaene 5437d49beb
feat(router): add max_total_tokens and empty_input validation (#68)
closes #65
2023-02-15 21:56:59 +01:00
OlivierDehaene 9af454142a
feat: add distributed tracing (#62) 2023-02-13 13:02:45 +01:00
Yannic Kilcher e520d5b349
fixed SSE naming (#61)
https://en.wikipedia.org/wiki/Server-sent_events
2023-02-08 22:30:11 +01:00
OlivierDehaene 2fe5e1b30e
V0.2.1 (#58) 2023-02-07 15:40:25 +01:00
OlivierDehaene 20c3c5940c
feat(router): refactor API and add openAPI schemas (#53) 2023-02-03 12:43:37 +01:00
OlivierDehaene b1482d9048
breaking(router): modify /generate API to only return generated text (#50)
@njhill, @yk FYI

generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.

We also remove the unused Vec.
2023-02-02 15:02:04 +01:00
OlivierDehaene 7b870e1e18
feat(router): use background task to manage request queue (#52)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-02-02 14:59:27 +01:00
OlivierDehaene 313194f6d7
feat(server): support repetition penalty (#47) 2023-02-01 15:58:42 +01:00
OlivierDehaene 017a2a8c2f
feat: Add token streaming using ServerSideEvents support (#41) 2023-01-31 17:04:00 +01:00
OlivierDehaene 54fec93193
fix(server): fix seeding with multiple shards (#44) 2023-01-31 16:01:15 +01:00
OlivierDehaene 4f9ac67cfa
Revert "feat: Add token streaming using ServerSideEvents support" (#40)
Reverts huggingface/text-generation-inference#36
2023-01-31 14:21:51 +01:00
OlivierDehaene 7fbfbb0dc5
feat: Add token streaming using ServerSideEvents support (#36)
Add token streaming using ServerSideEvents (SSE).

The signature of the SSE events is: 

```rust
struct Details {
    finish_reason: String,
    generated_tokens: u32,
    seed: Option<u64>,
}

struct StreamResponse {
    token: Token,
    generated_text: Option<String>,
    details: Option<Details>,
}

struct ErrorResponse {
    error: String,
}
```
2023-01-31 11:49:43 +01:00
OlivierDehaene cd298bc5e5
feat: Support sampling seeding (#37)
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
2023-01-30 15:36:16 +01:00
OlivierDehaene 1539d3cbbe
feat(router): Remove second lock from batcher hot path (#27)
@njhill
2023-01-26 16:29:13 +01:00
OlivierDehaene 5c01e2544c
fix(router): fix api-inference deployment (#31) 2023-01-23 17:42:14 +01:00
OlivierDehaene f9d0ec376a
feat(docker): Make the image compatible with api-inference (#29) 2023-01-23 17:11:27 +01:00
OlivierDehaene 15511edc01
feat(server): Support SantaCoder (#26) 2023-01-20 12:24:39 +01:00
Nick Hill f7ac394935
fix(router): Obey max batch size (#23) 2023-01-17 09:11:21 +01:00
Nick Hill e6d3eb5d5d
fix(server): Minor refactorization using new_zeros (#24)
- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher
2023-01-17 09:10:22 +01:00
Nick Hill 60472f9d2b
feat(router): Add const parameters to validation logic (#15)
I noticed some opportunity to collapse some of the logic, in case you
are interested.
2023-01-03 10:41:22 +01:00
Nick Hill 3efa5bbbfd
fix(router): Include special tokens when tokenizing (#14)
There's currently a discrepancy in the tokenization between the router
and python server code. The latter includes special tokens but former
does not.

This results in a token count mismatch for seq2seq models such as mt0
where the tokenizer emits an EOS token at the end.

This in turn results in some unexpected/incorrect output, in particular
when batch concatenation is involved, because the python code uses the
input length passed from the router for each row.

As far as I can tell, it is better to include this token in the encoder
`input_ids`, so I guess it's best to just adjust on the router side.
2022-12-30 19:31:44 +01:00
OlivierDehaene 32a253063d
feat: Return logprobs (#8) 2022-12-15 17:03:56 +01:00
OlivierDehaene 718096f695
feat: Support stop sequences (#7) 2022-12-12 18:25:22 +01:00
OlivierDehaene a2985036aa
feat(server): Add model tests (#6) 2022-12-08 18:49:33 +01:00
Nick Hill 31d76e238d
fix(batching): Avoid theoretical hang in batcher loop (#5)
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute

Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>
2022-12-05 10:10:59 +01:00
OlivierDehaene d6d5b12e03 fix(router): Handle tokenizer errors 2022-11-14 17:15:19 +01:00
OlivierDehaene 91f5f86280 fix(router): Fix HTTP status codes 2022-11-14 14:34:15 +01:00
OlivierDehaene 427d7cc444 feat(server): Support AutoModelForSeq2SeqLM 2022-11-04 18:03:04 +01:00
OlivierDehaene c5665f5c8b feat(server): Support generic AutoModelForCausalLM 2022-11-04 14:22:47 +01:00
OlivierDehaene b3b7ea0d74 feat: Use json formatter by default in docker image 2022-11-02 17:29:56 +01:00
OlivierDehaene 3cf6368c77 feat(server): Support all AutoModelForCausalLM on a best effort basis 2022-10-28 19:24:00 +02:00
OlivierDehaene 09674e6df9 feat(server): Support bitsandbytes 2022-10-27 14:25:29 +02:00
OlivierDehaene beb552127a feat(client): Simplify sharded logic 2022-10-22 23:40:05 +02:00
OlivierDehaene c837893370 feat(router): Add max_waiting_tokens 2022-10-21 16:40:05 +02:00
OlivierDehaene 895a341d06 fix(validation): Fix error messages 2022-10-21 10:59:15 +02:00
Olivier Dehaene f16f2f5ae1 v0.1.0 2022-10-20 19:14:44 +02:00
Olivier Dehaene 92c1ecd008 feat: Add arguments to CLI 2022-10-17 18:27:33 +02:00
Olivier Dehaene 5e5d8766a2 feat: Improve error handling 2022-10-17 14:59:00 +02:00
Olivier Dehaene bcb53903b8 feat: Add AML deployment 2022-10-15 20:21:50 +02:00
Olivier Dehaene bf99afe916 feat: Docker image 2022-10-14 15:56:21 +02:00
Olivier Dehaene 39df4d9975 Use axum 2022-10-11 18:14:39 +02:00
Olivier Dehaene e86ecbac63 ValidationError was not correctly handled 2022-10-11 16:53:40 +02:00
Olivier Dehaene 4c693e6524 Refactored gRPC interface
Added validation logic
2022-10-11 16:50:54 +02:00
Olivier Dehaene fa9a088467 Add load testing 2022-10-11 10:36:51 +02:00
Olivier Dehaene 295831a481 Init 2022-10-08 12:30:12 +02:00