OlivierDehaene
745f596c88
feat(server): use float16 ( #304 )
2023-05-10 15:51:10 +02:00
OlivierDehaene
68e9d6ab33
feat(server): shard token decode ( #303 )
2023-05-10 15:48:21 +02:00
OlivierDehaene
ad66f6ef9a
feat(server): optim flash causal lm decode_token ( #285 )
2023-05-09 18:26:19 +02:00
Nicolas Patry
b4aa87db58
fea(server): decrease convert RAM requirements ( #286 )
2023-05-05 17:57:02 +02:00
Nicolas Patry
690fc31757
fix(server): fix convert ( #284 )
2023-05-05 15:28:08 +02:00
Nicolas Patry
f08343d44d
fix(server): Removes the parallelism in file convertion (during download) ( #275 )
2023-05-04 15:22:54 +02:00
OlivierDehaene
85aa7e2e7b
feat(server): support hf endpoint weight layout ( #266 )
2023-05-03 11:36:24 +02:00
OlivierDehaene
4096000e34
fix(server): fix typo in tokenizers decode ( #269 )
...
closes #268
2023-05-03 10:10:34 +02:00
Ehsan M. Kermani
f092ba9b22
feat(server): add watermarking tests ( #248 )
2023-04-27 19:16:35 +02:00
OlivierDehaene
b9ae7e5da1
chore(server): update transformers ( #250 )
2023-04-27 09:57:41 +02:00
Nick Hill
34bca0b8d3
fix(server): Small tidy of code from recent changes ( #251 )
...
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
2023-04-27 09:57:28 +02:00
Nick Hill
b4cf832c40
fix(server): fix reshaping of bloom past_key_values in concatenate() ( #252 )
...
Introduced in #214
Fixes #249
2023-04-27 09:51:27 +02:00
Nicolas Patry
db2b4e0754
feat(router): new healthcheck that skips the queue ( #244 )
...
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2023-04-26 20:23:54 +02:00
OlivierDehaene
37b64a5c10
chore(server): update safetensors version ( #235 )
2023-04-25 13:50:56 +02:00
OlivierDehaene
ebc74d5666
feat(router): use number of tokens in batch as input for dynamic batching ( #226 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-04-24 17:59:00 +02:00
OlivierDehaene
98a3e0d135
chore(server): update huggingface-hub ( #227 )
2023-04-24 15:57:13 +02:00
Nick Hill
4a7dd4085a
feat(server): reduce memory requirement ( #214 )
2023-04-24 14:15:42 +02:00
OlivierDehaene
6ded76a4ae
v0.6.0 ( #222 )
2023-04-21 21:00:57 +02:00
OlivierDehaene
4b460e72fb
fix(server): fix flash batch filtering ( #220 )
2023-04-21 20:26:01 +02:00
OlivierDehaene
1ffea36ec2
fix(server): fix flash causal ( #219 )
2023-04-21 19:49:08 +02:00
OlivierDehaene
86bca365df
fix(server): fix flash causal ( #218 )
2023-04-21 19:42:16 +02:00
OlivierDehaene
afc5b999d0
fix(server): cleanup new flash past_key_values logic ( #217 )
2023-04-21 16:19:04 +02:00
OlivierDehaene
db4cb5e4ed
fix(server): fix past key values logic ( #216 )
...
@njhill fyi
2023-04-21 15:59:18 +02:00
OlivierDehaene
343437c7b5
feat(router): add device and dtype info ( #215 )
2023-04-21 15:36:29 +02:00
Nick Hill
ac8c0f6fe4
feat(server): flash attention past key value optimizations ( #213 )
2023-04-21 14:57:18 +02:00
OlivierDehaene
709d8936f6
feat(router): drop requests when client closes the channel ( #202 )
2023-04-20 11:07:40 +02:00
OlivierDehaene
b6ee0ec7b0
feat(router): add git sha to info route ( #208 )
2023-04-19 21:36:59 +02:00
OlivierDehaene
6837b2eb77
fix(docker): remove unused dependencies ( #205 )
2023-04-19 19:39:31 +02:00
OlivierDehaene
5d27f5259b
fix(server): fix hf_transfer issue with private repos ( #203 )
2023-04-19 17:36:16 +02:00
OlivierDehaene
a88c54bb4c
feat(server): check cuda capability when importing flash models ( #201 )
...
close #198
2023-04-19 12:52:37 +02:00
OlivierDehaene
e14ae3b5e9
feat(server): support quantization for flash models ( #200 )
...
closes #197
2023-04-19 12:51:11 +02:00
OlivierDehaene
7a1ba58557
fix(docker): fix docker image dependencies ( #187 )
2023-04-17 00:26:47 +02:00
OlivierDehaene
53ee09c0b0
fea(dockerfile): better layer caching ( #159 )
2023-04-14 10:12:21 +02:00
OlivierDehaene
64347b05ff
fix(ci): fix CVE in github-slug-action ( #174 )
2023-04-13 12:43:05 +02:00
OlivierDehaene
880a76eed5
feat(server): support sharded santacoder ( #167 )
2023-04-12 17:18:08 +02:00
OlivierDehaene
5fa8ae041c
feat(server): optimize decode for sane tokenizers ( #170 )
2023-04-12 12:03:10 +02:00
OlivierDehaene
6f0f1d70f6
v0.5.0 ( #168 )
2023-04-11 20:32:18 +02:00
OlivierDehaene
f26dfd0dc1
feat(server): support OPT models ( #55 )
...
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
2023-04-11 19:16:41 +02:00
OlivierDehaene
299217c95c
feat(server): add flash attention llama ( #144 )
2023-04-11 16:38:22 +02:00
OlivierDehaene
9987960062
feat(router): make router input validation optional ( #164 )
2023-04-09 20:22:27 +02:00
OlivierDehaene
1883d8ecde
feat(docker): improve flash_attention caching ( #160 )
2023-04-09 19:59:16 +02:00
OlivierDehaene
3f2542bb6a
fix(server): fix escape characters in stop sequence ( #155 )
2023-04-05 19:37:41 +02:00
OlivierDehaene
c0aeb32583
feat(server): flash santacoder ( #153 )
2023-04-03 19:06:42 +02:00
OlivierDehaene
fef1a1c381
v0.4.3 ( #152 )
2023-03-30 17:28:14 +02:00
OlivierDehaene
84722f3e33
v0.4.2 ( #151 )
2023-03-30 17:10:01 +02:00
OlivierDehaene
08b7e4a282
fix(server): fix flash neox rotary embeddings ( #150 )
2023-03-30 16:12:23 +02:00
OlivierDehaene
610bb1f978
feat(benchmark): tui based benchmarking tool ( #149 )
2023-03-30 15:26:27 +02:00
OlivierDehaene
c9bdaa8b73
feat(server): reduce mlp and attn in one op for flash neox ( #145 )
2023-03-28 16:51:41 +02:00
OlivierDehaene
f000068944
feat(server): clear cache on error ( #143 )
2023-03-28 11:29:35 +02:00
Nick Hill
8e8dd984d8
feat(server): Add mypy-protobuf ( #141 )
...
Generates .pyi files for protobuf stubs which provide strong typing
information. Very helpful for IDE auto-completion, etc.
2023-03-27 09:25:15 +02:00
Nick Hill
462530c2b0
fix(server): Avoid using try/except to determine kind of AutoModel ( #142 )
2023-03-27 09:23:22 +02:00
OlivierDehaene
ab5fd8cf93
v0.4.1 ( #140 )
2023-03-26 16:37:51 +02:00
OlivierDehaene
678b2f3900
feat(server): cleanup flash neox loading ( #139 )
2023-03-26 16:37:21 +02:00
OlivierDehaene
d6a93fe992
fix(server): fix flash-neox scores warping ( #137 )
2023-03-24 18:21:41 +01:00
OlivierDehaene
05e9a796cc
feat(server): flash neoX ( #133 )
2023-03-24 14:02:14 +01:00
OlivierDehaene
b49dbf2d88
fix(server): use server tokenizer as gt ( #128 )
2023-03-16 12:12:26 +01:00
OlivierDehaene
8ad60b752f
fix(server): add position ids to neox ( #126 )
2023-03-15 13:12:49 +01:00
OlivierDehaene
cbd36aa4d1
fix(server): revert gpt-neox optims ( #123 )
2023-03-13 22:57:08 +01:00
OlivierDehaene
411d6247f4
v0.4.0 ( #119 )
2023-03-09 16:07:01 +01:00
OlivierDehaene
c0795de2f2
fix(server): do not warp prefill logits ( #116 )
2023-03-09 13:00:10 +01:00
OlivierDehaene
1a2d68250a
feat: support typical sampling ( #114 )
...
closes #112
2023-03-09 11:33:57 +01:00
OlivierDehaene
941cd42e0c
fix(server): fix index out of range for watermarking ( #110 )
2023-03-08 18:29:08 +01:00
OlivierDehaene
b1485e18c5
fix(server): fix galactica batch ( #106 )
...
closes #105
2023-03-07 20:05:21 +01:00
OlivierDehaene
3fef90d50f
feat(clients): Python client ( #103 )
2023-03-07 18:52:22 +01:00
OlivierDehaene
cd5961b5da
feat: allow local models ( #101 )
...
closes #99
2023-03-06 14:39:36 +01:00
OlivierDehaene
9b205d33cc
fix(server): fix generate_stream by forcing tokens to be decoded correctly ( #100 )
2023-03-06 13:22:58 +01:00
OlivierDehaene
1c19b0934e
v0.3.2 ( #97 )
2023-03-03 18:42:20 +01:00
OlivierDehaene
0b6807caa4
feat(server): fix transformers commit ( #96 )
2023-03-03 17:56:27 +01:00
OlivierDehaene
2d39f199ae
feat(server): update to hf_transfer==0.1.2 ( #93 )
2023-03-03 11:26:27 +01:00
OlivierDehaene
9b8ea6a6c7
feat(server): add logits watermark ( #90 )
2023-03-02 12:30:41 +01:00
OlivierDehaene
65e2f1624e
fix(server): fix token_is_special ( #87 )
2023-02-24 17:20:00 +01:00
OlivierDehaene
0ac184ce77
feat(server): add special token bool ( #85 )
2023-02-24 15:55:57 +01:00
OlivierDehaene
4b1c9720c0
v0.3.1 ( #84 )
2023-02-24 13:27:41 +01:00
OlivierDehaene
44ce098c10
feat(server): pre-allocate max attention mask ( #75 )
2023-02-24 12:49:21 +01:00
OlivierDehaene
78063c0569
fix(server): remove position_ids from galactica forward ( #82 )
...
closes #80
2023-02-20 19:28:57 +01:00
OlivierDehaene
17bc841b1b
feat(server): enable hf-transfer ( #76 )
2023-02-18 14:04:11 +01:00
OlivierDehaene
c720555adc
v0.3.0 ( #72 )
2023-02-16 17:28:29 +01:00
OlivierDehaene
439fcaf810
feat(router): add prometheus metrics scrape endpoint ( #71 )
2023-02-16 17:18:53 +01:00
OlivierDehaene
c5a4a1faf3
feat(server): improve download logging ( #66 )
2023-02-15 16:11:32 +01:00
OlivierDehaene
0fbc691946
feat: add safetensors conversion ( #63 )
2023-02-14 13:02:16 +01:00
OlivierDehaene
9af454142a
feat: add distributed tracing ( #62 )
2023-02-13 13:02:45 +01:00
OlivierDehaene
1ad3250b89
fix(docker): increase shm size ( #60 )
2023-02-08 17:53:33 +01:00
OlivierDehaene
c503a639b1
feat(server): support t5 ( #59 )
2023-02-07 18:25:17 +01:00
OlivierDehaene
2fe5e1b30e
V0.2.1 ( #58 )
2023-02-07 15:40:25 +01:00
OlivierDehaene
4acc42a605
fix(server): better handling of inference mode ( #57 )
2023-02-07 15:38:22 +01:00
OlivierDehaene
20c3c5940c
feat(router): refactor API and add openAPI schemas ( #53 )
2023-02-03 12:43:37 +01:00
OlivierDehaene
b1482d9048
breaking(router): modify /generate API to only return generated text ( #50 )
...
@njhill, @yk FYI
generated_text was concatenated to the user prompt for legacy reason. We
want to remove this behaviour as we don't think it is useful and even
detrimonial to usability.
We also remove the unused Vec.
2023-02-02 15:02:04 +01:00
OlivierDehaene
df227ac20d
fix(server): allow greedy repetition penalty ( #51 )
2023-02-02 10:34:35 +01:00
OlivierDehaene
775115e3a5
feat(server): allow the server to use a local weight cache ( #49 )
2023-02-01 16:22:10 +01:00
OlivierDehaene
313194f6d7
feat(server): support repetition penalty ( #47 )
2023-02-01 15:58:42 +01:00
OlivierDehaene
2ad895a6cc
feat(server): allow gpt-neox models with odd vocab sizes to be sharded ( #48 )
2023-02-01 14:43:59 +01:00
OlivierDehaene
f830706b21
feat(server): Support GPT-Neox ( #39 )
2023-01-31 18:53:56 +01:00
OlivierDehaene
c6e8b9442b
fix(server): fix quantization for sharded models ( #45 )
2023-01-31 17:40:38 +01:00
OlivierDehaene
017a2a8c2f
feat: Add token streaming using ServerSideEvents support ( #41 )
2023-01-31 17:04:00 +01:00
OlivierDehaene
54fec93193
fix(server): fix seeding with multiple shards ( #44 )
2023-01-31 16:01:15 +01:00
OlivierDehaene
03bdf18290
fix(server): fix seeding on gpu ( #42 )
2023-01-31 14:30:33 +01:00
OlivierDehaene
4f9ac67cfa
Revert "feat: Add token streaming using ServerSideEvents support" ( #40 )
...
Reverts huggingface/text-generation-inference#36
2023-01-31 14:21:51 +01:00
OlivierDehaene
7fbfbb0dc5
feat: Add token streaming using ServerSideEvents support ( #36 )
...
Add token streaming using ServerSideEvents (SSE).
The signature of the SSE events is:
```rust
struct Details {
finish_reason: String,
generated_tokens: u32,
seed: Option<u64>,
}
struct StreamResponse {
token: Token,
generated_text: Option<String>,
details: Option<Details>,
}
struct ErrorResponse {
error: String,
}
```
2023-01-31 11:49:43 +01:00
OlivierDehaene
cd298bc5e5
feat: Support sampling seeding ( #37 )
...
Co-authored-by: Yannic Kilcher <yk@users.noreply.github.com>
2023-01-30 15:36:16 +01:00
OlivierDehaene
ce960be0a5
feat(bloom): use torch.nn.Linear and torch.nn.GELU ( #33 )
2023-01-26 15:33:45 +01:00
OlivierDehaene
13e7044ab7
fix(dockerfile): fix docker build ( #32 )
2023-01-24 19:52:39 +01:00
OlivierDehaene
1f570d181f
fix(server): Fix position ids ( #28 )
2023-01-20 15:35:22 +01:00
OlivierDehaene
15511edc01
feat(server): Support SantaCoder ( #26 )
2023-01-20 12:24:39 +01:00
Nick Hill
e6d3eb5d5d
fix(server): Minor refactorization using new_zeros ( #24 )
...
- Fix some type hints, in particular base tokenizer class
- Make use of `tensor.new_zero/empty` methods
- Simplify env var string parsing in launcher
2023-01-17 09:10:22 +01:00
OlivierDehaene
fcc2c5fcbf
feat(launcher): Log server stdout ( #19 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-01-05 12:01:23 +01:00
Nicolas Patry
b94f30215f
fix(server): Use cleanup_tokenization_spaces=False for lossless decoding ( #13 )
...
Fixes #12 in the easiest way I could think of.
2023-01-03 11:07:05 +01:00
Nick Hill
686cc66717
fix(server): Check for device type correctly when determining initial padding ( #16 )
...
AFAIK there is no torch device type called "gpu".
2022-12-30 19:30:42 +01:00
OlivierDehaene
611e21cb13
fix(server): Fix stop sequences ( #11 )
2022-12-16 16:03:39 +01:00
OlivierDehaene
32a253063d
feat: Return logprobs ( #8 )
2022-12-15 17:03:56 +01:00
OlivierDehaene
718096f695
feat: Support stop sequences ( #7 )
2022-12-12 18:25:22 +01:00
OlivierDehaene
042180d88f
fix(server): Only pad to multiple of 8 on GPUs
2022-12-08 19:37:37 +01:00
OlivierDehaene
a2985036aa
feat(server): Add model tests ( #6 )
2022-12-08 18:49:33 +01:00
Nick Hill
31d76e238d
fix(batching): Avoid theoretical hang in batcher loop ( #5 )
...
- Avoid theoretical hang in batcher loop
- Avoid a couple of clones in the router generate method
- Keep attention mask tensors as integers
- Remove num_heads attribute
Co-authored-by: OlivierDehaene <Olivier.dehaene@gmail.com>
2022-12-05 10:10:59 +01:00
OlivierDehaene
daa1d81d5e
feat(server): Support Galactica ( #4 )
2022-12-01 19:31:54 +01:00
OlivierDehaene
dccd5c2b1a
feat(server): Clarify CausalLMBatch concatenate method
2022-11-09 18:24:07 +01:00
OlivierDehaene
fa43fb71be
fix(server): Fix Transformers fork version
2022-11-08 17:42:38 +01:00
OlivierDehaene
4236e41b0d
feat(server): Improved doc
2022-11-07 12:53:56 +01:00
OlivierDehaene
427d7cc444
feat(server): Support AutoModelForSeq2SeqLM
2022-11-04 18:03:04 +01:00
OlivierDehaene
c5665f5c8b
feat(server): Support generic AutoModelForCausalLM
2022-11-04 14:22:47 +01:00
OlivierDehaene
755fc0e403
fix(models): Revert buggy support for AutoModel
2022-11-03 16:07:54 +01:00
OlivierDehaene
b3b7ea0d74
feat: Use json formatter by default in docker image
2022-11-02 17:29:56 +01:00
OlivierDehaene
3cf6368c77
feat(server): Support all AutoModelForCausalLM on a best effort basis
2022-10-28 19:24:00 +02:00
OlivierDehaene
09674e6df9
feat(server): Support bitsandbytes
2022-10-27 14:25:29 +02:00
Nicolas Patry
c8ce9b2515
feat(server): Use safetensors
...
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
2022-10-22 20:00:15 +02:00
Olivier Dehaene
f16f2f5ae1
v0.1.0
2022-10-20 19:14:44 +02:00
Olivier Dehaene
5e5d8766a2
feat: Improve error handling
2022-10-17 14:59:00 +02:00
Olivier Dehaene
bcb53903b8
feat: Add AML deployment
2022-10-15 20:21:50 +02:00
Olivier Dehaene
bf99afe916
feat: Docker image
2022-10-14 15:56:21 +02:00
Olivier Dehaene
4c693e6524
Refactored gRPC interface
...
Added validation logic
2022-10-11 16:50:54 +02:00
Olivier Dehaene
1d986983d5
fix: cleanup
2022-10-08 12:34:25 +02:00
Olivier Dehaene
295831a481
Init
2022-10-08 12:30:12 +02:00