Fix exllama wronfully loading (#990)

# What does this PR do? The [changes](https://github.com/huggingface/text-generation-inference/pull/986/files#diff-b72e45030214e50c8ff6e3be837057b3f3368b9779fd942ca680f949fe069eafR176) disabling exllama on old compute had unintended consequences of not setting `use_exllama` to `False` if `HAS_EXLLAMA` equals `False` **and** `CAN_EXLLAMA` equals `False`. This fixes this. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @OlivierDehaene @Narsil Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
2023-09-07 03:17:22 -04:00 · 2023-09-07 03:17:22 -04:00 · 935a77fb74
parent a9fdfb2464
commit 935a77fb74
1 changed files with 5 additions and 4 deletions
--- a/server/text_generation_server/utils/weights.py
+++ b/server/text_generation_server/utils/weights.py
@ -173,10 +173,11 @@ class Weights:
            from text_generation_server.utils.layers import HAS_EXLLAMA, CAN_EXLLAMA

            if use_exllama:
-                if not HAS_EXLLAMA and CAN_EXLLAMA:
-                    logger.warning(
-                        "Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True"
-                    )
+                if not HAS_EXLLAMA:
+                    if CAN_EXLLAMA:
+                        logger.warning(
+                            "Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True"
+                        )
                    use_exllama = False
                else:
                    logger.info("Using exllama kernels")