Fix exllama wronfully loading (#990)

# What does this PR do?
The
[changes](https://github.com/huggingface/text-generation-inference/pull/986/files#diff-b72e45030214e50c8ff6e3be837057b3f3368b9779fd942ca680f949fe069eafR176)
disabling exllama on old compute had unintended consequences of not
setting `use_exllama` to `False` if `HAS_EXLLAMA` equals `False` **and**
`CAN_EXLLAMA` equals `False`. This fixes this.

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?
@OlivierDehaene @Narsil 

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
This commit is contained in:
Maxime Laboissonnière 2023-09-07 03:17:22 -04:00 committed by GitHub
parent a9fdfb2464
commit 935a77fb74
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 5 additions and 4 deletions

View File

@ -173,10 +173,11 @@ class Weights:
from text_generation_server.utils.layers import HAS_EXLLAMA, CAN_EXLLAMA from text_generation_server.utils.layers import HAS_EXLLAMA, CAN_EXLLAMA
if use_exllama: if use_exllama:
if not HAS_EXLLAMA and CAN_EXLLAMA: if not HAS_EXLLAMA:
logger.warning( if CAN_EXLLAMA:
"Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True" logger.warning(
) "Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True"
)
use_exllama = False use_exllama = False
else: else:
logger.info("Using exllama kernels") logger.info("Using exllama kernels")