OlivierDehaene
|
a6c18c39bb
|
feat(server): use cuda graph in logits warping (#302)
|
2023-05-10 19:08:54 +02:00 |
OlivierDehaene
|
745f596c88
|
feat(server): use float16 (#304)
|
2023-05-10 15:51:10 +02:00 |
OlivierDehaene
|
68e9d6ab33
|
feat(server): shard token decode (#303)
|
2023-05-10 15:48:21 +02:00 |
OlivierDehaene
|
ad66f6ef9a
|
feat(server): optim flash causal lm decode_token (#285)
|
2023-05-09 18:26:19 +02:00 |
Nicolas Patry
|
b4aa87db58
|
fea(server): decrease convert RAM requirements (#286)
|
2023-05-05 17:57:02 +02:00 |
Nicolas Patry
|
690fc31757
|
fix(server): fix convert (#284)
|
2023-05-05 15:28:08 +02:00 |
Nicolas Patry
|
f08343d44d
|
fix(server): Removes the parallelism in file convertion (during download) (#275)
|
2023-05-04 15:22:54 +02:00 |
OlivierDehaene
|
85aa7e2e7b
|
feat(server): support hf endpoint weight layout (#266)
|
2023-05-03 11:36:24 +02:00 |
OlivierDehaene
|
4096000e34
|
fix(server): fix typo in tokenizers decode (#269)
closes #268
|
2023-05-03 10:10:34 +02:00 |
Ehsan M. Kermani
|
f092ba9b22
|
feat(server): add watermarking tests (#248)
|
2023-04-27 19:16:35 +02:00 |
Nick Hill
|
34bca0b8d3
|
fix(server): Small tidy of code from recent changes (#251)
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
|
2023-04-27 09:57:28 +02:00 |
Nick Hill
|
b4cf832c40
|
fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)
Introduced in #214
Fixes #249
|
2023-04-27 09:51:27 +02:00 |
Nicolas Patry
|
db2b4e0754
|
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
|
2023-04-26 20:23:54 +02:00 |
OlivierDehaene
|
ebc74d5666
|
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2023-04-24 17:59:00 +02:00 |
Nick Hill
|
4a7dd4085a
|
feat(server): reduce memory requirement (#214)
|
2023-04-24 14:15:42 +02:00 |
OlivierDehaene
|
4b460e72fb
|
fix(server): fix flash batch filtering (#220)
|
2023-04-21 20:26:01 +02:00 |
OlivierDehaene
|
1ffea36ec2
|
fix(server): fix flash causal (#219)
|
2023-04-21 19:49:08 +02:00 |
OlivierDehaene
|
86bca365df
|
fix(server): fix flash causal (#218)
|
2023-04-21 19:42:16 +02:00 |
OlivierDehaene
|
afc5b999d0
|
fix(server): cleanup new flash past_key_values logic (#217)
|
2023-04-21 16:19:04 +02:00 |
OlivierDehaene
|
db4cb5e4ed
|
fix(server): fix past key values logic (#216)
@njhill fyi
|
2023-04-21 15:59:18 +02:00 |
OlivierDehaene
|
343437c7b5
|
feat(router): add device and dtype info (#215)
|
2023-04-21 15:36:29 +02:00 |
Nick Hill
|
ac8c0f6fe4
|
feat(server): flash attention past key value optimizations (#213)
|
2023-04-21 14:57:18 +02:00 |
OlivierDehaene
|
709d8936f6
|
feat(router): drop requests when client closes the channel (#202)
|
2023-04-20 11:07:40 +02:00 |
OlivierDehaene
|
b6ee0ec7b0
|
feat(router): add git sha to info route (#208)
|
2023-04-19 21:36:59 +02:00 |
OlivierDehaene
|
a88c54bb4c
|
feat(server): check cuda capability when importing flash models (#201)
close #198
|
2023-04-19 12:52:37 +02:00 |
OlivierDehaene
|
e14ae3b5e9
|
feat(server): support quantization for flash models (#200)
closes #197
|
2023-04-19 12:51:11 +02:00 |
OlivierDehaene
|
7a1ba58557
|
fix(docker): fix docker image dependencies (#187)
|
2023-04-17 00:26:47 +02:00 |
OlivierDehaene
|
880a76eed5
|
feat(server): support sharded santacoder (#167)
|
2023-04-12 17:18:08 +02:00 |
OlivierDehaene
|
5fa8ae041c
|
feat(server): optimize decode for sane tokenizers (#170)
|
2023-04-12 12:03:10 +02:00 |
OlivierDehaene
|
f26dfd0dc1
|
feat(server): support OPT models (#55)
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
|
2023-04-11 19:16:41 +02:00 |
OlivierDehaene
|
299217c95c
|
feat(server): add flash attention llama (#144)
|
2023-04-11 16:38:22 +02:00 |
OlivierDehaene
|
9987960062
|
feat(router): make router input validation optional (#164)
|
2023-04-09 20:22:27 +02:00 |
OlivierDehaene
|
3f2542bb6a
|
fix(server): fix escape characters in stop sequence (#155)
|
2023-04-05 19:37:41 +02:00 |
OlivierDehaene
|
c0aeb32583
|
feat(server): flash santacoder (#153)
|
2023-04-03 19:06:42 +02:00 |
OlivierDehaene
|
08b7e4a282
|
fix(server): fix flash neox rotary embeddings (#150)
|
2023-03-30 16:12:23 +02:00 |
OlivierDehaene
|
610bb1f978
|
feat(benchmark): tui based benchmarking tool (#149)
|
2023-03-30 15:26:27 +02:00 |
OlivierDehaene
|
c9bdaa8b73
|
feat(server): reduce mlp and attn in one op for flash neox (#145)
|
2023-03-28 16:51:41 +02:00 |
OlivierDehaene
|
f000068944
|
feat(server): clear cache on error (#143)
|
2023-03-28 11:29:35 +02:00 |
Nick Hill
|
462530c2b0
|
fix(server): Avoid using try/except to determine kind of AutoModel (#142)
|
2023-03-27 09:23:22 +02:00 |
OlivierDehaene
|
678b2f3900
|
feat(server): cleanup flash neox loading (#139)
|
2023-03-26 16:37:21 +02:00 |
OlivierDehaene
|
d6a93fe992
|
fix(server): fix flash-neox scores warping (#137)
|
2023-03-24 18:21:41 +01:00 |
OlivierDehaene
|
05e9a796cc
|
feat(server): flash neoX (#133)
|
2023-03-24 14:02:14 +01:00 |
OlivierDehaene
|
b49dbf2d88
|
fix(server): use server tokenizer as gt (#128)
|
2023-03-16 12:12:26 +01:00 |
OlivierDehaene
|
8ad60b752f
|
fix(server): add position ids to neox (#126)
|
2023-03-15 13:12:49 +01:00 |
OlivierDehaene
|
c0795de2f2
|
fix(server): do not warp prefill logits (#116)
|
2023-03-09 13:00:10 +01:00 |
OlivierDehaene
|
1a2d68250a
|
feat: support typical sampling (#114)
closes #112
|
2023-03-09 11:33:57 +01:00 |
OlivierDehaene
|
941cd42e0c
|
fix(server): fix index out of range for watermarking (#110)
|
2023-03-08 18:29:08 +01:00 |
OlivierDehaene
|
b1485e18c5
|
fix(server): fix galactica batch (#106)
closes #105
|
2023-03-07 20:05:21 +01:00 |
OlivierDehaene
|
3fef90d50f
|
feat(clients): Python client (#103)
|
2023-03-07 18:52:22 +01:00 |