Commit Graph

298 Commits

Author SHA1 Message Date
Nicolas Patry 17837b1e51 Adding docs about GPTQ usage. 2023-06-15 19:41:04 +02:00
Nicolas Patry 16d0fb04ae Santacoder GPTQ support (quantized model seems awful, not sure if it's
prompting or the quantization itself).
2023-06-15 16:59:31 +02:00
Nicolas Patry 983c813f1d Typo. 2023-06-14 16:57:56 +02:00
Nicolas Patry 054a3d095c Triton is actually a dependency of torch on linux. 2023-06-14 15:03:17 +02:00
Nicolas Patry 732da6942b Remove lots of dead code, move triton to hard requirement
- Added option to upload to hub directly after quantizing.
2023-06-14 14:55:45 +02:00
Nicolas Patry 5de6863756 No one saw that, therefore it didn't happen. 2023-06-14 12:23:30 +02:00
Nicolas Patry 55cf4d257c Tiny fixes for falcon. 2023-06-14 09:42:55 +02:00
Nicolas Patry e5e552b496 Falcon 2023-06-14 09:42:55 +02:00
Ubuntu ee1f94e64b Fixing register bias + gptq_bits type. 2023-06-14 09:42:55 +02:00
Ubuntu ffe8fc4699 Fixing few things 2023-06-14 09:42:55 +02:00
Ubuntu dadbbc27d5 Neox. 2023-06-14 09:42:55 +02:00
Ubuntu 3fb8979a6d Re-enabling dim=dim in TensorParallelColumn because llama. 2023-06-14 09:42:55 +02:00
Ubuntu ae308f88ec Some fixes. 2023-06-14 09:42:55 +02:00
Ubuntu a0a194c391 Functionning quantization script. 2023-06-14 09:42:55 +02:00
Ubuntu 5a72715344 Adding quantization scripts. 2023-06-14 09:42:55 +02:00
Nicolas Patry da8ebf16fe Typo. 2023-06-14 09:42:55 +02:00
Ubuntu 0b5859213e Fixing the dockerfile (require triton + gcc for compiling). 2023-06-14 09:42:55 +02:00
Ubuntu 92f85c964d Removing dead code. 2023-06-14 09:42:55 +02:00
Ubuntu 9a12941bef [WIP] Inference support for GPTQ (llama at least)
Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
  doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.
2023-06-14 09:42:55 +02:00
OlivierDehaene 5ce89059f8
feat(server): pre-allocate past key values for flash causal LM (#412) 2023-06-12 18:30:29 +02:00
sayf eddine hammemi ca650e5bff
fix(makefile): Fix typo and use POSIX comparison in the makefile (#443)
# What does this PR do?

This PR fixes:
- The usage of non posix comparison which may fail depending on the
shell used (`=` will always work, `==` only with bash)
- Typo in the env variable name displayed in the error message
`BUILD_EXTENSION` instead of `BUILD_EXTENSIONS`

<!-- Remove if not applicable -->

Fixes #422
2023-06-12 15:24:53 +02:00
A.J d4eb60f48d
docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441)
# What does this PR do?
It solves a typo in the comment sections referencing the environment
variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this
variable have been found in code logic leading to undefined behaviour or
bugs. This PR is not expected to perform any code logic modification.
2023-06-12 13:59:22 +02:00
OlivierDehaene e496c9ba5b
feat(server): optimize dist ops (#434) 2023-06-09 11:55:29 +02:00
Nicolas Patry abd58ff82c
feat(server): Rework model loading (#344)
# What does this PR do?

Reworked the loading logic. Idea is to use cleaner loading code:

- Remove need for `no_init_weights`
- Remove all weird `bnb_linear` and `load_weights` and
`post_load_weights`.

New code layout:

- New class `Weights` in charge of handling loading the weights from
multiple files into appropiate tensors (potentially sharded)
- TP layers now are "shells", they contain the code to know what kind of
sharding we need + eventual `all_reduce`. They do not inherit from
linear, but they contain some kind of Linear instead
- the contained linear can be either FastLinear, BnbLinear or GPTq
Linear next.
- All modeling code is explictly made for sharding, process group is
just no-ops for non sharded code (removes a lot of test cases)

![Screenshot from 2023-05-19
23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f)

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
2023-06-08 14:51:52 +02:00
OlivierDehaene 19c41824cb chore: update openapi schema 2023-06-05 18:16:08 +02:00
OlivierDehaene 6abec14a7e
feat(server): batch tokenization for flash causal lm (#411) 2023-06-05 16:09:41 +02:00
OlivierDehaene 895c5f1562
feat(server): only compute prefill logprobs when asked (#406)
Close #288
2023-06-02 17:12:30 +02:00
OlivierDehaene 83b84486ad
feat(launcher): parse oom signal (#404) 2023-06-02 14:17:27 +02:00
OlivierDehaene 62fc401030
feat(sagemaker): add trust remote code to entrypoint (#394) 2023-06-02 09:51:06 +02:00
OlivierDehaene e7248fe90e v0.8.2 2023-06-01 19:49:13 +02:00
OlivierDehaene 95d3546976
feat(server): load santacoder/starcoder models with safetensors (#393)
Fix #366
2023-06-01 12:10:35 +02:00
OlivierDehaene c0928e6f26
feat(server): remove trust_remote_code requirement for falcon models (#396) 2023-06-01 12:07:41 +02:00
OlivierDehaene d69a0633be
fix(server): fix has_position_ids (#395)
Fix #389
2023-06-01 11:41:35 +02:00
OlivierDehaene db2ebe3947 v0.8.1 2023-05-31 12:08:40 +02:00
OlivierDehaene 337afb2842
fix(server): fix bnb quantization for CausalLM models (#385) 2023-05-31 11:48:28 +02:00
OlivierDehaene 87dc034b59
feat(server): add retry on download (#384) 2023-05-31 10:57:53 +02:00
OlivierDehaene 444400b457 increase health checks 2023-05-31 10:55:59 +02:00
OlivierDehaene 081b926584 v0.8.0 2023-05-30 18:39:35 +02:00
OlivierDehaene b8b950b37c
feat(server): support RefinedWeb models (#379) 2023-05-30 18:25:19 +02:00
OlivierDehaene bf7f1d5434 fix(server): fix quantization 2023-05-30 13:56:03 +02:00
OlivierDehaene 49a6c8c1b2 fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES 2023-05-30 13:27:48 +02:00
OlivierDehaene 146e72c3be fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES 2023-05-30 12:52:18 +02:00
CL-Shang 5fde8d9991
Fix issue when load AutoModelForSeq2SeqLM model (#370) 2023-05-26 12:31:47 +02:00
OlivierDehaene 62f91f78ac
feat(server): support vectorized warpers in flash causal lm (#317)
Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
2023-05-26 12:30:27 +02:00
OlivierDehaene 951930fbff
feat(benchmarker): add summary tables (#368) 2023-05-25 13:38:36 +02:00
OlivierDehaene 218c9adaa5
feat: decrease IPC proto size (#367)
Closes #307 #308
2023-05-24 19:19:57 +02:00
OlivierDehaene d31562f300
v0.7.0 (#353) 2023-05-23 21:20:49 +02:00
OlivierDehaene 942005386a
feat(router): log input/ouput at debug level (#364)
@njhill FYI
2023-05-23 20:47:37 +02:00
OlivierDehaene e3e487dc71
feat(server): support trust_remote_code (#363) 2023-05-23 20:40:39 +02:00
OlivierDehaene e9669a4085
feat(server): do not use device_map auto on single GPU (#362) 2023-05-23 19:12:12 +02:00