Commit Graph

132 Commits

Author SHA1 Message Date
OlivierDehaene a6c18c39bb
feat(server): use cuda graph in logits warping (#302) 2023-05-10 19:08:54 +02:00
OlivierDehaene 745f596c88
feat(server): use float16 (#304) 2023-05-10 15:51:10 +02:00
OlivierDehaene 68e9d6ab33
feat(server): shard token decode (#303) 2023-05-10 15:48:21 +02:00
OlivierDehaene ad66f6ef9a
feat(server): optim flash causal lm decode_token (#285) 2023-05-09 18:26:19 +02:00
Nicolas Patry b4aa87db58
fea(server): decrease convert RAM requirements (#286) 2023-05-05 17:57:02 +02:00
Nicolas Patry 690fc31757
fix(server): fix convert (#284) 2023-05-05 15:28:08 +02:00
Nicolas Patry f08343d44d
fix(server): Removes the parallelism in file convertion (during download) (#275) 2023-05-04 15:22:54 +02:00
OlivierDehaene 85aa7e2e7b
feat(server): support hf endpoint weight layout (#266) 2023-05-03 11:36:24 +02:00
OlivierDehaene 4096000e34
fix(server): fix typo in tokenizers decode (#269)
closes #268
2023-05-03 10:10:34 +02:00
Ehsan M. Kermani f092ba9b22
feat(server): add watermarking tests (#248) 2023-04-27 19:16:35 +02:00
OlivierDehaene b9ae7e5da1
chore(server): update transformers (#250) 2023-04-27 09:57:41 +02:00
Nick Hill 34bca0b8d3
fix(server): Small tidy of code from recent changes (#251)
remaining_decode_tokens was calculated twice in Seq2SeqLMBatch.filter()
2023-04-27 09:57:28 +02:00
Nick Hill b4cf832c40
fix(server): fix reshaping of bloom past_key_values in concatenate() (#252)
Introduced in #214 

Fixes #249
2023-04-27 09:51:27 +02:00
Nicolas Patry db2b4e0754
feat(router): new healthcheck that skips the queue (#244)
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2023-04-26 20:23:54 +02:00
OlivierDehaene 37b64a5c10
chore(server): update safetensors version (#235) 2023-04-25 13:50:56 +02:00
OlivierDehaene ebc74d5666
feat(router): use number of tokens in batch as input for dynamic batching (#226)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-04-24 17:59:00 +02:00
OlivierDehaene 98a3e0d135
chore(server): update huggingface-hub (#227) 2023-04-24 15:57:13 +02:00
Nick Hill 4a7dd4085a
feat(server): reduce memory requirement (#214) 2023-04-24 14:15:42 +02:00
OlivierDehaene 6ded76a4ae
v0.6.0 (#222) 2023-04-21 21:00:57 +02:00
OlivierDehaene 4b460e72fb
fix(server): fix flash batch filtering (#220) 2023-04-21 20:26:01 +02:00
OlivierDehaene 1ffea36ec2
fix(server): fix flash causal (#219) 2023-04-21 19:49:08 +02:00
OlivierDehaene 86bca365df
fix(server): fix flash causal (#218) 2023-04-21 19:42:16 +02:00
OlivierDehaene afc5b999d0
fix(server): cleanup new flash past_key_values logic (#217) 2023-04-21 16:19:04 +02:00
OlivierDehaene db4cb5e4ed
fix(server): fix past key values logic (#216)
@njhill fyi
2023-04-21 15:59:18 +02:00
OlivierDehaene 343437c7b5
feat(router): add device and dtype info (#215) 2023-04-21 15:36:29 +02:00
Nick Hill ac8c0f6fe4
feat(server): flash attention past key value optimizations (#213) 2023-04-21 14:57:18 +02:00
OlivierDehaene 709d8936f6
feat(router): drop requests when client closes the channel (#202) 2023-04-20 11:07:40 +02:00
OlivierDehaene b6ee0ec7b0
feat(router): add git sha to info route (#208) 2023-04-19 21:36:59 +02:00
OlivierDehaene 6837b2eb77
fix(docker): remove unused dependencies (#205) 2023-04-19 19:39:31 +02:00
OlivierDehaene 5d27f5259b
fix(server): fix hf_transfer issue with private repos (#203) 2023-04-19 17:36:16 +02:00
OlivierDehaene a88c54bb4c
feat(server): check cuda capability when importing flash models (#201)
close #198
2023-04-19 12:52:37 +02:00
OlivierDehaene e14ae3b5e9
feat(server): support quantization for flash models (#200)
closes #197
2023-04-19 12:51:11 +02:00
OlivierDehaene 7a1ba58557
fix(docker): fix docker image dependencies (#187) 2023-04-17 00:26:47 +02:00
OlivierDehaene 53ee09c0b0
fea(dockerfile): better layer caching (#159) 2023-04-14 10:12:21 +02:00
OlivierDehaene 64347b05ff
fix(ci): fix CVE in github-slug-action (#174) 2023-04-13 12:43:05 +02:00
OlivierDehaene 880a76eed5
feat(server): support sharded santacoder (#167) 2023-04-12 17:18:08 +02:00
OlivierDehaene 5fa8ae041c
feat(server): optimize decode for sane tokenizers (#170) 2023-04-12 12:03:10 +02:00
OlivierDehaene 6f0f1d70f6
v0.5.0 (#168) 2023-04-11 20:32:18 +02:00
OlivierDehaene f26dfd0dc1
feat(server): support OPT models (#55)
OPT models do not all have a `tokenizer.json` file on the hub at the
moment. Can't merge for now.
2023-04-11 19:16:41 +02:00
OlivierDehaene 299217c95c
feat(server): add flash attention llama (#144) 2023-04-11 16:38:22 +02:00
OlivierDehaene 9987960062
feat(router): make router input validation optional (#164) 2023-04-09 20:22:27 +02:00
OlivierDehaene 1883d8ecde
feat(docker): improve flash_attention caching (#160) 2023-04-09 19:59:16 +02:00
OlivierDehaene 3f2542bb6a
fix(server): fix escape characters in stop sequence (#155) 2023-04-05 19:37:41 +02:00
OlivierDehaene c0aeb32583
feat(server): flash santacoder (#153) 2023-04-03 19:06:42 +02:00
OlivierDehaene fef1a1c381
v0.4.3 (#152) 2023-03-30 17:28:14 +02:00
OlivierDehaene 84722f3e33
v0.4.2 (#151) 2023-03-30 17:10:01 +02:00
OlivierDehaene 08b7e4a282
fix(server): fix flash neox rotary embeddings (#150) 2023-03-30 16:12:23 +02:00
OlivierDehaene 610bb1f978
feat(benchmark): tui based benchmarking tool (#149) 2023-03-30 15:26:27 +02:00
OlivierDehaene c9bdaa8b73
feat(server): reduce mlp and attn in one op for flash neox (#145) 2023-03-28 16:51:41 +02:00
OlivierDehaene f000068944
feat(server): clear cache on error (#143) 2023-03-28 11:29:35 +02:00