hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Ikko Eltociear Ashimine	2a13f1a046	chore: fix typo in mpt_modeling.py (#737 ) # What does this PR do? Fixed typo. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> implemetation -> implementation ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-31 15:43:44 +02:00
Nicolas Patry	932bdd93ff	Adding Rope scaling. (#741 ) # What does this PR do? - Adds Rope NTK scaling. Done because https://github.com/huggingface/text-generation-inference/pull/529 was closed Took some code from https://github.com/huggingface/transformers/pull/24653 - `--rope-scaling` and `--rope-factor` are added separately. I considered having a single one and parsing something line ("linear:4.0" , or "dynamic") but decided against it because it would push more parsing+validation a bit everywhere (both in the launcher and the server). Fixes #512 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-31 15:38:47 +02:00
Jae-Won Chung	b9633c46d0	Fix typing in `Model.generate_token` (#733 ) ## What does this PR do? This PR fixes a minor type annotation issue in the signature of `Model.generate_token`. All existing overrides of `Model.generate_token` return `Tuple[List[Generation], Optional[B]]`: `3ef5ffbc64/server/text_generation_server/models/causal_lm.py (L535-L537)` `3ef5ffbc64/server/text_generation_server/models/flash_causal_lm.py (L802-L804)` `3ef5ffbc64/server/text_generation_server/models/seq2seq_lm.py (L589-L591)` I suspect that back in `017a2a8c` when `GeneratedText` and `Generation` were separated, the function signature was not updated. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? CC @OlivierDehaene	2023-07-31 14:35:14 +02:00
Nicolas Patry	92bb56b0c1	Local gptq support. (#738 ) # What does this PR do? Redoes #719 <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-31 10:32:52 +02:00
OlivierDehaene	3ef5ffbc64	v1.0.0 (#727 )	2023-07-28 17:43:46 +02:00
OlivierDehaene	afd04dc71e	feat(server): update vllm version (#723 )	2023-07-28 15:36:38 +02:00
OlivierDehaene	9f18f4c006	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
OlivierDehaene	ab96b9aec3	feat(server): support new falcon config (#712 )	2023-07-27 18:38:57 +02:00
OlivierDehaene	2efd46ef95	fix(server): fix missing datasets in quantize	2023-07-27 14:50:45 +02:00
OlivierDehaene	8bd0adb135	fix(server): fix quantization python requirements (#708 )	2023-07-27 12:28:10 +02:00
Nicolas Patry	a0d55358d2	feat(server): Using `quantize_config.json` instead of GPTQ_BITS env variables. (#671 ) - Current PR is not great because we're side stepping the `Weights.__init__` but Weights shouldn't requires anything related to the config or the model_id as it aims to be a simple Wrapper over multi file loading. - Ideal solution would be to use something like Rust enum ``` enum Quantize{ Bitandbytes(Bitsandbytes), GPTQ(bits: usize, groupsize: usize) ``` And passing that around during load. Unfortunately we don't have access to this, so for now, side-stepping seems easier. - Re-enabling groupsize<0 with exllama (confirmed it works.) Helps #601 In next steps we should make sure our quantization script uses that format and make it standard. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-25 13:00:27 +02:00
OlivierDehaene	37df6df38e	fix(server): fix exllama buffers (#689 ) Close #683	2023-07-24 14:25:43 +02:00
OlivierDehaene	73a4d65d26	feat: add cuda memory fraction (#659 ) Close #673	2023-07-24 11:43:58 +02:00
Yang, Bo	15b3e9ffb0	Directly load GPTBigCode to specified device (#618 ) This PR directly load GPTBigCode to specified device, avoiding moving model between devices. # What does this PR do? This PR directly load GPTBigCode to specified device, avoiding moving model between devices. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene OR @Narsil	2023-07-21 11:27:31 +02:00
Nicolas Patry	d5b5bc750f	feat(server): Add exllama GPTQ CUDA kernel support #553 (#666 ) Just trying to get the integration tests to pass. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>	2023-07-21 10:59:00 +02:00
OlivierDehaene	bf94df3c71	fix(server): use mem_get_info to get kv cache size (#664 ) Close https://github.com/huggingface/text-generation-inference/issues/649 Close https://github.com/huggingface/text-generation-inference/issues/651 Close https://github.com/huggingface/text-generation-inference/issues/653 Close #636	2023-07-20 17:23:49 +02:00
Nicolas Patry	08b8eec1d7	fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661 )	2023-07-20 16:04:15 +02:00
fxmarty	362883f259	fix(server): llama v2 GPTQ (#648 ) As per title & reported https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956 https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5 Test it: ``` GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq ``` & ``` curl 127.0.0.1:8080/generate \ -X POST \ -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \ -H 'Content-Type: application/json' ```	2023-07-20 15:02:54 +02:00
cdawg	214c06f510	Add trust_remote_code to quantize script (#647 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes a bug appeared with MR #587 fixing issue #552. See the discussion in #552. With MR #587 the trust_remote_code variable is not passed to AutoModelForCausalLM, but is found in the function signature. This prevents models like falcon to be quantized, because trust_remote_code is required. This MR fixes the issue. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [X] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ -->	2023-07-20 13:53:08 +02:00
OlivierDehaene	fe80f5360c	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
OlivierDehaene	5e6ddfd6a4	fix(server): fix llamav2 config (#635 )	2023-07-18 18:49:42 +02:00
OlivierDehaene	cf83f9b66f	v0.9.3 (#634 )	2023-07-18 18:11:20 +02:00
Nicolas Patry	211b211ec0	feat(server): add support for llamav2 (#633 )	2023-07-18 18:09:53 +02:00
OlivierDehaene	3b71c38558	feat(server): flash attention v2 (#624 )	2023-07-18 16:21:18 +02:00
Nicolas Patry	4d38a1c4ad	feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587 ) but should work on more configurations (no need for 2 GPUs, less RAM usage). # What does this PR do? Reworking the quantization script so it's still universal (not llama specific) but should work on more configurations (no need for 2 GPUs, less RAM usage). Still need to investigate the potential differences in quantization results. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-18 12:19:05 +02:00
OlivierDehaene	a2cf1bdb2f	fix(server): empty_cache when stopped	2023-07-15 13:58:19 +02:00
OlivierDehaene	c58a0c185b	v0.9.2 (#616 )	2023-07-14 16:31:48 +02:00
OlivierDehaene	5b9de4a1d3	fix(server): blacklist local files (#609 ) Close #589 #602	2023-07-13 21:54:55 +02:00
ssmi153	3628559516	GPTQ Env vars: catch correct type of error (#596 ) # What does this PR do? When passing in environment variables like gptq_bits, we still get errors thrown from TGI because the try/catch block is catching the wrong type of error. This PR aims to fix that. @Narsil - let me know if this is how you want this formatted. My Python is a little shaky, so I hope this syntax is correct.	2023-07-12 19:57:46 +02:00
OlivierDehaene	f2f0289fb9	feat(server): empty cache on errors	2023-07-12 17:06:19 +02:00
Nicolas Patry	67347950b7	feat(server): Implements sharding for non divisible `vocab_size`. (#583 ) - The code is relatively easy (just disable the checks on Embedding and Head) This cannot be done in the same easy fashion for hidden_dim/head_dim. It's relatively easy on some models (classic MHA) but it would make the other models (MQA) much more complex, and GPTQ quantization another quite hairy piece of code.	2023-07-12 16:43:31 +02:00
ssmi153	2c4bf88268	fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590 ) # What does this PR do? This fixes a typo and extends the GPTP_BITS environment variables through to the second method which requires the same logic. Please let me know if there's anything I've misunderstood in this change. Thanks @Narsil for the original fix.	2023-07-12 14:17:35 +02:00
Adam Kowalski	7f9072228a	fix(server): Adding logger import to t5_modeling.py (#585 ) Logger is referenced during the apex importing but is not imported, causing a NameError	2023-07-12 10:40:32 +02:00
Nicolas Patry	db4efbf4bc	fix(server): T5 weights names. (#582 ) Fixes #541	2023-07-12 10:01:42 +02:00
Nicolas Patry	5bd2ab6583	feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580 ) # What does this PR do? Some models are already converted, and do not have those values in the file, this enables users to use them with less friction. Went for pure env based because adding flags would end up (imo) very tedious to maintain. There's a lot of sanitation to do: those flags would be errors if not used in conjuction with `--quantize gptq`. Then the flags need to exist in the launcher and the server passing them all throughout all function calls. This PR is intended as an easy escape hatch, not the defacto method to use gptq in TGI. Fixes #500	2023-07-12 10:00:02 +02:00
Nicolas Patry	f0181436f4	fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579 ) Fixes #555	2023-07-12 09:51:34 +02:00
OlivierDehaene	b4024edd45	feat: better errors for warmup and TP (#575 ) Close #571	2023-07-10 14:47:15 +02:00
Nicolas Patry	e943a294bc	fix(server): harden the weights choice to save on disk. (#561 ) - Look at `transformers` base class to check for `_key_to_ignore_on_load_missing` or `_tied_weights` which are the standard attributes to select the keys to NOT save on disk (since they are ignored) - Modified safetensors code (to be reflected in safetensors even if it's an internal function). - Will not work for trust_remote_code=True repos (like santacoder). Should help with : https://github.com/huggingface/text-generation-inference/issues/555 and : https://github.com/huggingface/text-generation-inference/pull/501 and https://github.com/huggingface/text-generation-inference/issues/556 and https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593	2023-07-07 14:50:12 +02:00
OlivierDehaene	31b36cca21	v0.9.1 (#558 )	2023-07-06 16:05:42 +02:00
OlivierDehaene	c4bb5264ac	fix(server): decrease memory fragmentation (#557 )	2023-07-06 14:28:33 +02:00
OlivierDehaene	31e2253ae7	feat(server): use latest flash attention commit (#543 ) @njhill FYI	2023-07-04 20:23:55 +02:00
Nick Hill	e4b26aa10b	fix(server): avoid errors for very small top_p values (#544 ) See https://github.com/huggingface/transformers/pull/24111 I didn't add validation to the `__init__` method since it's not done for other values/warpers.	2023-07-04 20:11:33 +02:00
Antoni Baum	2a101207d4	fix(server): Handle loading from local files for MPT (#534 ) This PR allows the MPT model to be loaded from local files. Without this change, an exception will be thrown by `hf_hub_download` function if `model_id` is a local path.	2023-07-04 18:37:25 +02:00
Antoni Baum	8405581fcd	fix: Update server/Makefile to include Makefile-vllm (#520 ) # What does this PR do? For consistency and ease of use (you can just run `make` to install vllm without any extra steps). <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-07-04 09:39:25 +02:00
Nicolas Patry	1da07e85aa	feat(server): Add Non flash MPT. (#514 ) # What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes https://github.com/huggingface/text-generation-inference/issues/361 Fixes https://github.com/huggingface/text-generation-inference/issues/491 Fixes https://github.com/huggingface/text-generation-inference/issues/290	2023-07-03 13:01:46 +02:00
OlivierDehaene	e28a809004	v0.9.0 (#525 )	2023-07-01 19:25:41 +02:00
Nicolas Patry	ecf6dc3a5a	feat: Add the option to force another dtype than `f16`. (#513 )	2023-06-30 20:30:09 +02:00
OlivierDehaene	e74bd41e0f	feat(server): add paged attention to flash models (#516 ) Closes #478	2023-06-30 19:09:59 +02:00
Antoni Baum	ae466a8736	fix(server): Do not init process group if already initialized (#388 )	2023-06-26 12:32:54 +02:00
Nicolas Patry	aefde28b45	feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438 ) Let's start discussing implementation. - Need to expose the quantization scripts (either included here or add doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa) - Make sure GPTQ works for multiple models (priority to Falcon). Currently it means that every place we use `get_{tensor\|sharded}` to check for quantization. My idea is to reintegrate as much as possible into `utils/layer.py` by expanding `load_multi` to be a bit more generic. This might require some thinking, but ultimately the `qweight,qzeros,scales,g_idx` should be in a single place, and independant of bias presence. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-06-26 12:27:01 +02:00
Nicolas Patry	776d150c55	feat(server): Adding new ignore_rule for conversion. (#485 )	2023-06-23 12:41:13 +02:00
Nicolas Patry	49b4b33e80	feat(server): Update convert logic. (#483 ) Should be more robust to shared tensors (ok when using `from_pretrained). But forcing us to add new checks in our loading code (since the chosen key to keep might be different from `transformers`). --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>	2023-06-23 12:40:46 +02:00
Nicolas Patry	c9c65ab323	fix(server): Fixing T5 in case the names are mixed up. (#475 )	2023-06-20 18:03:36 +02:00
OlivierDehaene	53aa9194c8	fix(server): fix warpers on CPU (#472 ) Closes #471	2023-06-20 11:06:10 +02:00
OlivierDehaene	ece7ffa40a	feat(server): improve flash attention import errors (#465 ) @lewtun, is this enough? Closes #458 Closes #456	2023-06-19 09:53:45 +02:00
OlivierDehaene	f59fb8b630	feat(router): add ngrok integration (#453 )	2023-06-16 16:25:11 +02:00
OlivierDehaene	5ce89059f8	feat(server): pre-allocate past key values for flash causal LM (#412 )	2023-06-12 18:30:29 +02:00
OlivierDehaene	e496c9ba5b	feat(server): optimize dist ops (#434 )	2023-06-09 11:55:29 +02:00
Nicolas Patry	abd58ff82c	feat(server): Rework model loading (#344 ) # What does this PR do? Reworked the loading logic. Idea is to use cleaner loading code: - Remove need for `no_init_weights` - Remove all weird `bnb_linear` and `load_weights` and `post_load_weights`. New code layout: - New class `Weights` in charge of handling loading the weights from multiple files into appropiate tensors (potentially sharded) - TP layers now are "shells", they contain the code to know what kind of sharding we need + eventual `all_reduce`. They do not inherit from linear, but they contain some kind of Linear instead - the contained linear can be either FastLinear, BnbLinear or GPTq Linear next. - All modeling code is explictly made for sharding, process group is just no-ops for non sharded code (removes a lot of test cases) ![Screenshot from 2023-05-19 23-19-59](https://github.com/huggingface/text-generation-inference/assets/204321/9a802654-74a3-488c-87a8-073743a6143f) --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.taildb5d.ts.net> Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co> Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>	2023-06-08 14:51:52 +02:00
OlivierDehaene	6abec14a7e	feat(server): batch tokenization for flash causal lm (#411 )	2023-06-05 16:09:41 +02:00
OlivierDehaene	895c5f1562	feat(server): only compute prefill logprobs when asked (#406 ) Close #288	2023-06-02 17:12:30 +02:00
OlivierDehaene	e7248fe90e	v0.8.2	2023-06-01 19:49:13 +02:00
OlivierDehaene	95d3546976	feat(server): load santacoder/starcoder models with safetensors (#393 ) Fix #366	2023-06-01 12:10:35 +02:00
OlivierDehaene	c0928e6f26	feat(server): remove trust_remote_code requirement for falcon models (#396 )	2023-06-01 12:07:41 +02:00
OlivierDehaene	d69a0633be	fix(server): fix has_position_ids (#395 ) Fix #389	2023-06-01 11:41:35 +02:00
OlivierDehaene	db2ebe3947	v0.8.1	2023-05-31 12:08:40 +02:00
OlivierDehaene	337afb2842	fix(server): fix bnb quantization for CausalLM models (#385 )	2023-05-31 11:48:28 +02:00
OlivierDehaene	87dc034b59	feat(server): add retry on download (#384 )	2023-05-31 10:57:53 +02:00
OlivierDehaene	081b926584	v0.8.0	2023-05-30 18:39:35 +02:00
OlivierDehaene	b8b950b37c	feat(server): support RefinedWeb models (#379 )	2023-05-30 18:25:19 +02:00
OlivierDehaene	bf7f1d5434	fix(server): fix quantization	2023-05-30 13:56:03 +02:00
CL-Shang	5fde8d9991	Fix issue when load AutoModelForSeq2SeqLM model (#370 )	2023-05-26 12:31:47 +02:00
OlivierDehaene	62f91f78ac	feat(server): support vectorized warpers in flash causal lm (#317 ) Co-authored-by: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>	2023-05-26 12:30:27 +02:00
OlivierDehaene	218c9adaa5	feat: decrease IPC proto size (#367 ) Closes #307 #308	2023-05-24 19:19:57 +02:00
OlivierDehaene	d31562f300	v0.7.0 (#353 )	2023-05-23 21:20:49 +02:00
OlivierDehaene	e3e487dc71	feat(server): support trust_remote_code (#363 )	2023-05-23 20:40:39 +02:00
OlivierDehaene	e9669a4085	feat(server): do not use device_map auto on single GPU (#362 )	2023-05-23 19:12:12 +02:00
OlivierDehaene	cfaa858070	feat(server): support fp16 for t5 (#360 ) Fixes #349	2023-05-23 18:16:48 +02:00
OlivierDehaene	94377efa78	chore(sever): update requirements (#357 ) Fixes #338	2023-05-23 18:03:22 +02:00
OlivierDehaene	4f4c9c1665	fix(server): t5 cannot run in f16 (#356 ) Fix #349	2023-05-23 12:15:54 +02:00
OlivierDehaene	91d9beec90	fix(server): fix init for flash causal lm (#352 ) Fixes #347	2023-05-22 15:05:32 +02:00
OlivierDehaene	e649bf9a55	feat(server): Support BLOOMChat-176B (#348 ) (#351 ) @njhill, temporary workaround to be able to run our CI as secrets are not available to runners run by external contributors. I will ask around to see if there is a better way. Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2023-05-22 13:36:00 +02:00
OlivierDehaene	5a58226130	fix(server): fix decode token (#334 ) Fixes #333 --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2023-05-16 23:23:27 +02:00
OlivierDehaene	e71471bec9	feat: add snapshot testing (#282 )	2023-05-15 23:36:30 +02:00
Nicolas Patry	f58f0a0364	Single place for TP layers + Dropout Layer Norm + FastLinear (#329 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-05-15 17:30:47 +02:00
Nicolas Patry	d7a97aa0b6	Removing dead variables. (#327 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-05-15 15:14:17 +02:00
Nicolas Patry	91e674bb85	Lifting check_unitialized. (#325 ) # What does this PR do? Lifting check_unitialized. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-05-15 11:32:25 +02:00
Nicolas Patry	73d84c6ee5	Hotfixes for santacoder/bigcode. (#294 ) # What does this PR do? Hotfixes: - Uses `model_type`=`gpt_bigcode` for more general usage. - Hotfixes linked lm_head vs wte_embedding (safetensors file do not contain the key, correctly when the file is sharded, where as pytorch copies the tensor) <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2023-05-15 10:35:20 +02:00
OlivierDehaene	8a8f43410d	chore(docker): use nvidia base image (#318 )	2023-05-12 17:32:40 +02:00
Nicolas Patry	76a48cd365	feat(server): GPTQ quantization (step1) (#277 ) Changes only the type from `bool` to `Option<Enum>` pretty much everywhere. - Use `Optional[str]` in Python (easier to manage than importing type everywhere). Except for the cli to get proper validation - Updated all models to handle gracefully new values. (Error out if unknown value, or gptq since not implemented). <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2023-05-12 14:46:41 +02:00
OlivierDehaene	4f6d038c0b	fix(server): fix multinomial implem in Sampling	2023-05-11 13:30:38 +02:00
OlivierDehaene	a6c18c39bb	feat(server): use cuda graph in logits warping (#302 )	2023-05-10 19:08:54 +02:00
OlivierDehaene	745f596c88	feat(server): use float16 (#304 )	2023-05-10 15:51:10 +02:00
OlivierDehaene	68e9d6ab33	feat(server): shard token decode (#303 )	2023-05-10 15:48:21 +02:00
OlivierDehaene	ad66f6ef9a	feat(server): optim flash causal lm decode_token (#285 )	2023-05-09 18:26:19 +02:00
Nicolas Patry	b4aa87db58	fea(server): decrease convert RAM requirements (#286 )	2023-05-05 17:57:02 +02:00
Nicolas Patry	690fc31757	fix(server): fix convert (#284 )	2023-05-05 15:28:08 +02:00
Nicolas Patry	f08343d44d	fix(server): Removes the parallelism in file convertion (during download) (#275 )	2023-05-04 15:22:54 +02:00
OlivierDehaene	85aa7e2e7b	feat(server): support hf endpoint weight layout (#266 )	2023-05-03 11:36:24 +02:00
OlivierDehaene	4096000e34	fix(server): fix typo in tokenizers decode (#269 ) closes #268	2023-05-03 10:10:34 +02:00

1 2 3 4 5 ...

273 Commits