hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	17837b1e51	Adding docs about GPTQ usage.	2023-06-15 19:41:04 +02:00
Nicolas Patry	16d0fb04ae	Santacoder GPTQ support (quantized model seems awful, not sure if it's prompting or the quantization itself).	2023-06-15 16:59:31 +02:00
Nicolas Patry	983c813f1d	Typo.	2023-06-14 16:57:56 +02:00
Nicolas Patry	054a3d095c	Triton is actually a dependency of torch on linux.	2023-06-14 15:03:17 +02:00
Nicolas Patry	732da6942b	Remove lots of dead code, move triton to hard requirement - Added option to upload to hub directly after quantizing.	2023-06-14 14:55:45 +02:00
Nicolas Patry	5de6863756	No one saw that, therefore it didn't happen.	2023-06-14 12:23:30 +02:00
Nicolas Patry	55cf4d257c	Tiny fixes for falcon.	2023-06-14 09:42:55 +02:00
Nicolas Patry	e5e552b496	Falcon	2023-06-14 09:42:55 +02:00
Ubuntu	ee1f94e64b	Fixing register bias + gptq_bits type.	2023-06-14 09:42:55 +02:00
Ubuntu	ffe8fc4699	Fixing few things	2023-06-14 09:42:55 +02:00
Ubuntu	dadbbc27d5	Neox.	2023-06-14 09:42:55 +02:00
Ubuntu	3fb8979a6d	Re-enabling dim=dim in TensorParallelColumn because llama.	2023-06-14 09:42:55 +02:00
Ubuntu	ae308f88ec	Some fixes.	2023-06-14 09:42:55 +02:00
Ubuntu	a0a194c391	Functionning quantization script.	2023-06-14 09:42:55 +02:00
Ubuntu	5a72715344	Adding quantization scripts.	2023-06-14 09:42:55 +02:00
Nicolas Patry	da8ebf16fe	Typo.	2023-06-14 09:42:55 +02:00
Ubuntu	0b5859213e	Fixing the dockerfile (require triton + gcc for compiling).	2023-06-14 09:42:55 +02:00
Ubuntu	92f85c964d	Removing dead code.	2023-06-14 09:42:55 +02:00
Ubuntu	9a12941bef	[WIP] Inference support for GPTQ (llama at least) Let's start discussing implementation. - Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa) - Make sure GPTQ works for multiple models (priority to Falcon). Currently it means that every place we use `get_{tensor\|sharded}` to check for quantization. My idea is to reintegrate as much as possible into `utils/layer.py` by expanding `load_multi` to be a bit more generic. This might require some thinking, but ultimately the `qweight,qzeros,scales,g_idx` should be in a single place, and independant of bias presence.	2023-06-14 09:42:55 +02:00
OlivierDehaene	5ce89059f8	feat(server): pre-allocate past key values for flash causal LM (#412 )	2023-06-12 18:30:29 +02:00
sayf eddine hammemi	ca650e5bff	fix(makefile): Fix typo and use POSIX comparison in the makefile (#443 ) # What does this PR do? This PR fixes: - The usage of non posix comparison which may fail depending on the shell used (`=` will always work, `==` only with bash) - Typo in the env variable name displayed in the error message `BUILD_EXTENSION` instead of `BUILD_EXTENSIONS` <!-- Remove if not applicable --> Fixes #422	2023-06-12 15:24:53 +02:00
A.J	d4eb60f48d	docs(launcher): fix CUDA_VISIBLE_DEVICES helper comment (#441 ) # What does this PR do? It solves a typo in the comment sections referencing the environment variable `CUDA_VISIBLE_DEVICES`. No misspelling references to this variable have been found in code logic leading to undefined behaviour or bugs. This PR is not expected to perform any code logic modification.	2023-06-12 13:59:22 +02:00
OlivierDehaene	e496c9ba5b	feat(server): optimize dist ops (#434 )	2023-06-09 11:55:29 +02:00
Nicolas Patry	abd58ff82c	feat(server): Rework model loading (#344 ) # What does this PR do? Reworked the loading logic. Idea is to use cleaner loading code: - Remove need for `no_init_weights` - Remove all weird `bnb_linear` and `load_weights` and `post_load_weights`. New code layout: - New class `Weights` in charge of handling loading the weights from multiple files into appropiate tensors (potentially sharded) - TP layers now are "shells", they contain the code to know what kind of sharding we need + eventual `all_reduce`. They do not inherit from linear, but they contain some kind of Linear instead - the contained linear can be either FastLinear, BnbLinear or GPTq Linear next. - All modeling code is explictly made for sharding, process group is just no-ops for non sharded code (removes a lot of test cases) ![Screenshot from 2023-05-19 23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f) --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net> Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>	2023-06-08 14:51:52 +02:00
OlivierDehaene	19c41824cb	chore: update openapi schema	2023-06-05 18:16:08 +02:00
OlivierDehaene	6abec14a7e	feat(server): batch tokenization for flash causal lm (#411 )	2023-06-05 16:09:41 +02:00
OlivierDehaene	895c5f1562	feat(server): only compute prefill logprobs when asked (#406 ) Close #288	2023-06-02 17:12:30 +02:00
OlivierDehaene	83b84486ad	feat(launcher): parse oom signal (#404 )	2023-06-02 14:17:27 +02:00
OlivierDehaene	62fc401030	feat(sagemaker): add trust remote code to entrypoint (#394 )	2023-06-02 09:51:06 +02:00
OlivierDehaene	e7248fe90e	v0.8.2	2023-06-01 19:49:13 +02:00
OlivierDehaene	95d3546976	feat(server): load santacoder/starcoder models with safetensors (#393 ) Fix #366	2023-06-01 12:10:35 +02:00
OlivierDehaene	c0928e6f26	feat(server): remove trust_remote_code requirement for falcon models (#396 )	2023-06-01 12:07:41 +02:00
OlivierDehaene	d69a0633be	fix(server): fix has_position_ids (#395 ) Fix #389	2023-06-01 11:41:35 +02:00
OlivierDehaene	db2ebe3947	v0.8.1	2023-05-31 12:08:40 +02:00
OlivierDehaene	337afb2842	fix(server): fix bnb quantization for CausalLM models (#385 )	2023-05-31 11:48:28 +02:00
OlivierDehaene	87dc034b59	feat(server): add retry on download (#384 )	2023-05-31 10:57:53 +02:00
OlivierDehaene	444400b457	increase health checks	2023-05-31 10:55:59 +02:00
OlivierDehaene	081b926584	v0.8.0	2023-05-30 18:39:35 +02:00
OlivierDehaene	b8b950b37c	feat(server): support RefinedWeb models (#379 )	2023-05-30 18:25:19 +02:00
OlivierDehaene	bf7f1d5434	fix(server): fix quantization	2023-05-30 13:56:03 +02:00
OlivierDehaene	49a6c8c1b2	fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES	2023-05-30 13:27:48 +02:00
OlivierDehaene	146e72c3be	fix(launcher): parse num cuda devices from CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES	2023-05-30 12:52:18 +02:00
CL-Shang	5fde8d9991	Fix issue when load AutoModelForSeq2SeqLM model (#370 )	2023-05-26 12:31:47 +02:00
OlivierDehaene	62f91f78ac	feat(server): support vectorized warpers in flash causal lm (#317 ) Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>	2023-05-26 12:30:27 +02:00
OlivierDehaene	951930fbff	feat(benchmarker): add summary tables (#368 )	2023-05-25 13:38:36 +02:00
OlivierDehaene	218c9adaa5	feat: decrease IPC proto size (#367 ) Closes #307 #308	2023-05-24 19:19:57 +02:00
OlivierDehaene	d31562f300	v0.7.0 (#353 )	2023-05-23 21:20:49 +02:00
OlivierDehaene	942005386a	feat(router): log input/ouput at debug level (#364 ) @njhill FYI	2023-05-23 20:47:37 +02:00
OlivierDehaene	e3e487dc71	feat(server): support trust_remote_code (#363 )	2023-05-23 20:40:39 +02:00
OlivierDehaene	e9669a4085	feat(server): do not use device_map auto on single GPU (#362 )	2023-05-23 19:12:12 +02:00

1 2 3 4 5 ...

298 Commits All Branches Search

298 Commits

All Branches