OlivierDehaene
8bd0adb135
fix(server): fix quantization python requirements ( #708 )
2023-07-27 12:28:10 +02:00
OlivierDehaene
cf83f9b66f
v0.9.3 ( #634 )
2023-07-18 18:11:20 +02:00
OlivierDehaene
c58a0c185b
v0.9.2 ( #616 )
2023-07-14 16:31:48 +02:00
OlivierDehaene
31b36cca21
v0.9.1 ( #558 )
2023-07-06 16:05:42 +02:00
OlivierDehaene
31e2253ae7
feat(server): use latest flash attention commit ( #543 )
...
@njhill FYI
2023-07-04 20:23:55 +02:00
Nicolas Patry
1da07e85aa
feat(server): Add Non flash MPT. ( #514 )
...
# What does this PR do?
This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.
Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290
2023-07-03 13:01:46 +02:00
OlivierDehaene
e28a809004
v0.9.0 ( #525 )
2023-07-01 19:25:41 +02:00
Nicolas Patry
abd58ff82c
feat(server): Rework model loading ( #344 )
...
# What does this PR do?
Reworked the loading logic. Idea is to use cleaner loading code:
- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.
New code layout:
- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)
![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f )
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
2023-06-08 14:51:52 +02:00
OlivierDehaene
e7248fe90e
v0.8.2
2023-06-01 19:49:13 +02:00
OlivierDehaene
db2ebe3947
v0.8.1
2023-05-31 12:08:40 +02:00
OlivierDehaene
081b926584
v0.8.0
2023-05-30 18:39:35 +02:00
OlivierDehaene
d31562f300
v0.7.0 ( #353 )
2023-05-23 21:20:49 +02:00
OlivierDehaene
94377efa78
chore(sever): update requirements ( #357 )
...
Fixes #338
2023-05-23 18:03:22 +02:00
OlivierDehaene
91d9beec90
fix(server): fix init for flash causal lm ( #352 )
...
Fixes #347
2023-05-22 15:05:32 +02:00
OlivierDehaene
37b64a5c10
chore(server): update safetensors version ( #235 )
2023-04-25 13:50:56 +02:00
OlivierDehaene
98a3e0d135
chore(server): update huggingface-hub ( #227 )
2023-04-24 15:57:13 +02:00
OlivierDehaene
6ded76a4ae
v0.6.0 ( #222 )
2023-04-21 21:00:57 +02:00
OlivierDehaene
6837b2eb77
fix(docker): remove unused dependencies ( #205 )
2023-04-19 19:39:31 +02:00
OlivierDehaene
5d27f5259b
fix(server): fix hf_transfer issue with private repos ( #203 )
2023-04-19 17:36:16 +02:00
OlivierDehaene
7a1ba58557
fix(docker): fix docker image dependencies ( #187 )
2023-04-17 00:26:47 +02:00
OlivierDehaene
64347b05ff
fix(ci): fix CVE in github-slug-action ( #174 )
2023-04-13 12:43:05 +02:00
OlivierDehaene
6f0f1d70f6
v0.5.0 ( #168 )
2023-04-11 20:32:18 +02:00
OlivierDehaene
299217c95c
feat(server): add flash attention llama ( #144 )
2023-04-11 16:38:22 +02:00
OlivierDehaene
fef1a1c381
v0.4.3 ( #152 )
2023-03-30 17:28:14 +02:00
OlivierDehaene
84722f3e33
v0.4.2 ( #151 )
2023-03-30 17:10:01 +02:00
OlivierDehaene
ab5fd8cf93
v0.4.1 ( #140 )
2023-03-26 16:37:51 +02:00
OlivierDehaene
411d6247f4
v0.4.0 ( #119 )
2023-03-09 16:07:01 +01:00
OlivierDehaene
3fef90d50f
feat(clients): Python client ( #103 )
2023-03-07 18:52:22 +01:00
OlivierDehaene
1c19b0934e
v0.3.2 ( #97 )
2023-03-03 18:42:20 +01:00
OlivierDehaene
2d39f199ae
feat(server): update to hf_transfer==0.1.2 ( #93 )
2023-03-03 11:26:27 +01:00
OlivierDehaene
4b1c9720c0
v0.3.1 ( #84 )
2023-02-24 13:27:41 +01:00
OlivierDehaene
17bc841b1b
feat(server): enable hf-transfer ( #76 )
2023-02-18 14:04:11 +01:00
OlivierDehaene
c720555adc
v0.3.0 ( #72 )
2023-02-16 17:28:29 +01:00
OlivierDehaene
9af454142a
feat: add distributed tracing ( #62 )
2023-02-13 13:02:45 +01:00
OlivierDehaene
2fe5e1b30e
V0.2.1 ( #58 )
2023-02-07 15:40:25 +01:00
OlivierDehaene
20c3c5940c
feat(router): refactor API and add openAPI schemas ( #53 )
2023-02-03 12:43:37 +01:00
OlivierDehaene
54fec93193
fix(server): fix seeding with multiple shards ( #44 )
2023-01-31 16:01:15 +01:00
OlivierDehaene
fcc2c5fcbf
feat(launcher): Log server stdout ( #19 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2023-01-05 12:01:23 +01:00
OlivierDehaene
a2985036aa
feat(server): Add model tests ( #6 )
2022-12-08 18:49:33 +01:00
OlivierDehaene
4236e41b0d
feat(server): Improved doc
2022-11-07 12:53:56 +01:00
OlivierDehaene
b3b7ea0d74
feat: Use json formatter by default in docker image
2022-11-02 17:29:56 +01:00
OlivierDehaene
3cf6368c77
feat(server): Support all AutoModelForCausalLM on a best effort basis
2022-10-28 19:24:00 +02:00
OlivierDehaene
09674e6df9
feat(server): Support bitsandbytes
2022-10-27 14:25:29 +02:00
Olivier Dehaene
f16f2f5ae1
v0.1.0
2022-10-20 19:14:44 +02:00
Olivier Dehaene
5e5d8766a2
feat: Improve error handling
2022-10-17 14:59:00 +02:00
Olivier Dehaene
bf99afe916
feat: Docker image
2022-10-14 15:56:21 +02:00
Olivier Dehaene
295831a481
Init
2022-10-08 12:30:12 +02:00