preemo_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Michael Feil	972e9a7f7c	update causal batch for ct2 and fix nf4 (#17 ) * update causal batch for ct2 and fix nf4 * bump the ctranslate2 version --------- Co-authored-by: Michael Feil <michael.feil@michaelfeil.eu>	2024-02-09 11:07:14 -08:00
Michael Feil	ff703cb867	Adding ctranslate2 quantization and inference: moving the contribution (#1 ) * rebaseing the commit on preemo fork. * reformatting and changes. * update dockerfile * update changes for dockerfile * adapt path * rebaseing the commit on preemo fork. * reformatting and changes. * update dockerfile * update changes for dockerfile * adapt path --------- Co-authored-by: michaelfeil <me@michaelfeil.eu>	2023-10-02 11:12:49 -07:00
Yang, Bo	f93012d59c	Merge pull request #4 from michaelfeil/bnb_4bit 4bit quantization with bitsandbytes	2023-09-08 14:52:32 -07:00
Yang, Bo	072f267cc3	Initialize v_cache to avoid NaNs (#12 )	2023-08-23 14:23:59 -07:00
Yang, Bo	2fda8fe812	Initialize v_cache to avoid NaNs (#11 )	2023-08-23 14:07:06 -07:00
Michael Feil	a9838bba2f	Modify exllama weight	2023-08-03 23:20:59 +02:00
Yang, Bo	8af4a7a0b0	Merge branch 'main' into bnb_4bit	2023-08-02 12:47:17 -07:00
Yang, Bo	b5fadc4c28	Don't enable custom kernels if CUDA is not available (#6 )	2023-08-02 09:51:54 -07:00
Yang, Bo	8a5f80bb61	Add AutoCausalLM (#5 ) Currently `BLOOMSharded` is a subclass of `CausalLM`, while it skips `CausalLM`'s constructor. This is a supprising behavior that we might want to avoid. This PR extract `CausalLM`'s constructor to `AutoCausalLM` to detect settings from `model_id`, so that we don't have to skip `CausalLM`'s constructor.	2023-08-02 09:35:40 -07:00
michaelfeil	656f2fe4dc	fix: typo	2023-08-02 16:56:14 +02:00
michaelfeil	44fa36b5bf	restoring commit from dev branch, rebase on current master	2023-08-01 18:15:18 +02:00
OlivierDehaene	afd04dc71e	feat(server): update vllm version (#723 )	2023-07-28 15:36:38 +02:00
OlivierDehaene	9f18f4c006	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
OlivierDehaene	ab96b9aec3	feat(server): support new falcon config (#712 )	2023-07-27 18:38:57 +02:00
OlivierDehaene	2efd46ef95	fix(server): fix missing datasets in quantize	2023-07-27 14:50:45 +02:00
OlivierDehaene	8bd0adb135	fix(server): fix quantization python requirements (#708 )	2023-07-27 12:28:10 +02:00
Nicolas Patry	a0d55358d2	feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671 ) - Current PR is not great because we're side stepping the `Weights.__init__` but Weights shouldn't requires anything related to the config or the model_id as it aims to be a simple Wrapper over multi file loading. - Ideal solution would be to use something like Rust enum ``` enum Quantize{ Bitandbytes(Bitsandbytes), GPTQ(bits: usize, groupsize: usize) ``` And passing that around during load. Unfortunately we don't have access to this, so for now, side-stepping seems easier. - Re-enabling groupsize<0 with exllama (confirmed it works.) Helps #601 In next steps we should make sure our quantization script uses that format and make it standard. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-25 13:00:27 +02:00
OlivierDehaene	37df6df38e	fix(server): fix exllama buffers (#689 ) Close #683	2023-07-24 14:25:43 +02:00
OlivierDehaene	73a4d65d26	feat: add cuda memory fraction (#659 ) Close #673	2023-07-24 11:43:58 +02:00
Yang, Bo	15b3e9ffb0	Directly load GPTBigCode to specified device (#618 ) This PR directly load GPTBigCode to specified device, avoiding moving model between devices. # What does this PR do? This PR directly load GPTBigCode to specified device, avoiding moving model between devices. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene OR @Narsil	2023-07-21 11:27:31 +02:00
Nicolas Patry	d5b5bc750f	feat(server): Add exllama GPTQ CUDA kernel support #553 (#666 ) Just trying to get the integration tests to pass. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>	2023-07-21 10:59:00 +02:00
OlivierDehaene	bf94df3c71	fix(server): use mem_get_info to get kv cache size (#664 ) Close https://github.com/huggingface/text-generation-inference/issues/649 Close https://github.com/huggingface/text-generation-inference/issues/651 Close https://github.com/huggingface/text-generation-inference/issues/653 Close #636	2023-07-20 17:23:49 +02:00
Nicolas Patry	08b8eec1d7	fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661 )	2023-07-20 16:04:15 +02:00
fxmarty	362883f259	fix(server): llama v2 GPTQ (#648 ) As per title & reported https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956 https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5 Test it: ``` GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq ``` & ``` curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \ -H 'Content-Type: application/json' ```	2023-07-20 15:02:54 +02:00
cdawg	214c06f510	Add trust_remote_code to quantize script (#647 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes a bug appeared with MR #587 fixing issue #552. See the discussion in #552. With MR #587 the trust_remote_code variable is not passed to AutoModelForCausalLM, but is found in the function signature. This prevents models like falcon to be quantized, because trust_remote_code is required. This MR fixes the issue. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [X] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ -->	2023-07-20 13:53:08 +02:00
OlivierDehaene	fe80f5360c	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
OlivierDehaene	5e6ddfd6a4	fix(server): fix llamav2 config (#635 )	2023-07-18 18:49:42 +02:00
OlivierDehaene	cf83f9b66f	v0.9.3 (#634 )	2023-07-18 18:11:20 +02:00
Nicolas Patry	211b211ec0	feat(server): add support for llamav2 (#633 )	2023-07-18 18:09:53 +02:00
OlivierDehaene	3b71c38558	feat(server): flash attention v2 (#624 )	2023-07-18 16:21:18 +02:00
Nicolas Patry	4d38a1c4ad	feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587 ) but should work on more configurations (no need for 2 GPUs, less RAM usage). # What does this PR do? Reworking the quantization script so it's still universal (not llama specific) but should work on more configurations (no need for 2 GPUs, less RAM usage). Still need to investigate the potential differences in quantization results. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-18 12:19:05 +02:00
OlivierDehaene	a2cf1bdb2f	fix(server): empty_cache when stopped	2023-07-15 13:58:19 +02:00
OlivierDehaene	c58a0c185b	v0.9.2 (#616 )	2023-07-14 16:31:48 +02:00
OlivierDehaene	5b9de4a1d3	fix(server): blacklist local files (#609 ) Close #589 #602	2023-07-13 21:54:55 +02:00
ssmi153	3628559516	GPTQ Env vars: catch correct type of error (#596 ) # What does this PR do? When passing in environment variables like gptq_bits, we still get errors thrown from TGI because the try/catch block is catching the wrong type of error. This PR aims to fix that. @Narsil - let me know if this is how you want this formatted. My Python is a little shaky, so I hope this syntax is correct.	2023-07-12 19:57:46 +02:00
OlivierDehaene	f2f0289fb9	feat(server): empty cache on errors	2023-07-12 17:06:19 +02:00
Nicolas Patry	67347950b7	feat(server): Implements sharding for non divisible `vocab_size`. (#583 ) - The code is relatively easy (just disable the checks on Embedding and Head) This cannot be done in the same easy fashion for hidden_dim/head_dim. It's relatively easy on some models (classic MHA) but it would make the other models (MQA) much more complex, and GPTQ quantization another quite hairy piece of code.	2023-07-12 16:43:31 +02:00
ssmi153	2c4bf88268	fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590 ) # What does this PR do? This fixes a typo and extends the GPTP_BITS environment variables through to the second method which requires the same logic. Please let me know if there's anything I've misunderstood in this change. Thanks @Narsil for the original fix.	2023-07-12 14:17:35 +02:00
Adam Kowalski	7f9072228a	fix(server): Adding logger import to t5_modeling.py (#585 ) Logger is referenced during the apex importing but is not imported, causing a NameError	2023-07-12 10:40:32 +02:00
Nicolas Patry	db4efbf4bc	fix(server): T5 weights names. (#582 ) Fixes #541	2023-07-12 10:01:42 +02:00
Nicolas Patry	5bd2ab6583	feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580 ) # What does this PR do? Some models are already converted, and do not have those values in the file, this enables users to use them with less friction. Went for pure env based because adding flags would end up (imo) very tedious to maintain. There's a lot of sanitation to do: those flags would be errors if not used in conjuction with `--quantize gptq`. Then the flags need to exist in the launcher and the server passing them all throughout all function calls. This PR is intended as an easy escape hatch, not the defacto method to use gptq in TGI. Fixes #500	2023-07-12 10:00:02 +02:00
Nicolas Patry	f0181436f4	fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579 ) Fixes #555	2023-07-12 09:51:34 +02:00
OlivierDehaene	b4024edd45	feat: better errors for warmup and TP (#575 ) Close #571	2023-07-10 14:47:15 +02:00
Nicolas Patry	e943a294bc	fix(server): harden the weights choice to save on disk. (#561 ) - Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : https://github.com/huggingface/text-generation-inference/issues/555 and : https://github.com/huggingface/text-generation-inference/pull/501 and https://github.com/huggingface/text-generation-inference/issues/556 and https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593	2023-07-07 14:50:12 +02:00
OlivierDehaene	31b36cca21	v0.9.1 (#558 )	2023-07-06 16:05:42 +02:00
OlivierDehaene	c4bb5264ac	fix(server): decrease memory fragmentation (#557 )	2023-07-06 14:28:33 +02:00
OlivierDehaene	31e2253ae7	feat(server): use latest flash attention commit (#543 ) @njhill FYI	2023-07-04 20:23:55 +02:00
Nick Hill	e4b26aa10b	fix(server): avoid errors for very small top_p values (#544 ) See https://github.com/huggingface/transformers/pull/24111 I didn't add validation to the `__init__` method since it's not done for other values/warpers.	2023-07-04 20:11:33 +02:00
Antoni Baum	2a101207d4	fix(server): Handle loading from local files for MPT (#534 ) This PR allows the MPT model to be loaded from local files. Without this change, an exception will be thrown by `hf_hub_download` function if `model_id` is a local path.	2023-07-04 18:37:25 +02:00
Antoni Baum	8405581fcd	fix: Update server/Makefile to include Makefile-vllm (#520 ) # What does this PR do? For consistency and ease of use (you can just run `make` to install vllm without any extra steps). <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-04 09:39:25 +02:00

1 2 3 4 5

229 Commits