Commit Graph

335 Commits

Author SHA1 Message Date
OlivierDehaene bf94df3c71
fix(server): use mem_get_info to get kv cache size (#664)
Close
https://github.com/huggingface/text-generation-inference/issues/649
Close
https://github.com/huggingface/text-generation-inference/issues/651
Close
https://github.com/huggingface/text-generation-inference/issues/653
Close #636
2023-07-20 17:23:49 +02:00
Nicolas Patry 08b8eec1d7
fix(server): Fixing non parameters in quantize script `bigcode/starcoder` was an example. (#661) 2023-07-20 16:04:15 +02:00
fxmarty 362883f259
fix(server): llama v2 GPTQ (#648)
As per title & reported
https://github.com/huggingface/text-generation-inference/issues/601#issuecomment-1641435956
https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ/discussions/5

Test it:

```
GPTQ_BITS=4 GPTQ_GROUPSIZE=1 text-generation-launcher --model-id TheBloke/Llama-2-70B-chat-GPTQ --port 8080 --num-shard 4 --quantize gptq
```
&
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"hey llama","parameters":{"max_new_tokens":256}}' \
    -H 'Content-Type: application/json'
```
2023-07-20 15:02:54 +02:00
cdawg 214c06f510
Add trust_remote_code to quantize script (#647)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes a bug appeared with MR #587 fixing issue #552.
See the discussion in #552.

With MR #587 the trust_remote_code variable is not passed to
AutoModelForCausalLM, but is found in the function signature. This
prevents models like falcon to be quantized, because trust_remote_code
is required. This MR fixes the issue.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [X] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
@Narsil
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @

 -->
2023-07-20 13:53:08 +02:00
Nicolas Patry 5a1512c025
docs: Update README.md (#643) 2023-07-19 13:39:12 +02:00
Nicolas Patry 1c81df15cd
docs: Update README.md (#639) 2023-07-19 13:38:52 +02:00
OlivierDehaene b66b190403
feat(router): ngrok edge (#642) 2023-07-19 11:59:58 +02:00
OlivierDehaene fe80f5360c
feat(server): auto max_batch_total_tokens for flash att models (#630) 2023-07-19 09:31:25 +02:00
OlivierDehaene 5e6ddfd6a4
fix(server): fix llamav2 config (#635) 2023-07-18 18:49:42 +02:00
OlivierDehaene cf83f9b66f
v0.9.3 (#634) 2023-07-18 18:11:20 +02:00
Nicolas Patry 211b211ec0
feat(server): add support for llamav2 (#633) 2023-07-18 18:09:53 +02:00
OlivierDehaene 3b71c38558
feat(server): flash attention v2 (#624) 2023-07-18 16:21:18 +02:00
Nicolas Patry 4d38a1c4ad
feat(server): Reworking the quantization script so it's still universal (not llama specific) (#587)
but should work on more configurations (no need for 2 GPUs, less RAM
usage).


# What does this PR do?

Reworking the quantization script so it's still universal (not llama
specific)

but should work on more configurations (no need for 2 GPUs, less RAM
usage).

Still need to investigate the potential differences in quantization
results.


<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-07-18 12:19:05 +02:00
OlivierDehaene 44acf72a73
fea(launcher): debug logs (#623) 2023-07-17 19:03:07 +02:00
Nicolas Patry bc2873246c
fix(launcher): Rename `b-float16` to `bfloat16` in the launcher arg (#621) 2023-07-17 18:38:16 +02:00
OlivierDehaene a2cf1bdb2f fix(server): empty_cache when stopped 2023-07-15 13:58:19 +02:00
OlivierDehaene c58a0c185b
v0.9.2 (#616) 2023-07-14 16:31:48 +02:00
OlivierDehaene 5b9de4a1d3
fix(server): blacklist local files (#609)
Close #589 #602
2023-07-13 21:54:55 +02:00
Victor Muštar c8b077be79
docs: README: Add logo + baseline (#611)
![image](https://github.com/huggingface/text-generation-inference/assets/3841370/58177321-479f-4ad1-b3bc-cec027423984)
2023-07-13 21:45:20 +02:00
OlivierDehaene 982ce3227b
feat(router): explicit warning if revision is not set (#608) 2023-07-13 18:59:38 +02:00
OlivierDehaene b7327205a6
feat(launcher): add arg validation and drop subprocess (#595) 2023-07-13 14:22:37 +02:00
ssmi153 3628559516
GPTQ Env vars: catch correct type of error (#596)
# What does this PR do?

When passing in environment variables like gptq_bits, we still get
errors thrown from TGI because the try/catch block is catching the wrong
type of error. This PR aims to fix that.

@Narsil - let me know if this is how you want this formatted. My Python
is a little shaky, so I hope this syntax is correct.
2023-07-12 19:57:46 +02:00
OlivierDehaene f2f0289fb9 feat(server): empty cache on errors 2023-07-12 17:06:19 +02:00
Nicolas Patry 67347950b7
feat(server): Implements sharding for non divisible `vocab_size`. (#583)
- The code is relatively easy (just disable the checks on Embedding and
Head)

This cannot be done in the same easy fashion for hidden_dim/head_dim.
It's relatively easy on some models (classic MHA) but it would make the
other
models (MQA) much more complex, and GPTQ quantization another quite
hairy piece
of code.
2023-07-12 16:43:31 +02:00
ssmi153 2c4bf88268
fix(server): Bug fixes for GPTQ_BITS environment variable passthrough (#590)
# What does this PR do?

This fixes a typo and extends the GPTP_BITS environment variables
through to the second method which requires the same logic. Please let
me know if there's anything I've misunderstood in this change.

Thanks @Narsil for the original fix.
2023-07-12 14:17:35 +02:00
Adam Kowalski 7f9072228a
fix(server): Adding logger import to t5_modeling.py (#585)
Logger is referenced during the apex importing but is not imported,
causing a NameError
2023-07-12 10:40:32 +02:00
Nicolas Patry db4efbf4bc
fix(server): T5 weights names. (#582)
Fixes #541
2023-07-12 10:01:42 +02:00
Nicolas Patry f063ebde10
chore: migrate ci region for more availability. (#581) 2023-07-12 10:01:01 +02:00
Nicolas Patry 5bd2ab6583
feat(server): Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE. (#580)
# What does this PR do?

Some models are already converted, and do not have those values in the
file, this enables users to use them with less friction.

Went for pure env based because adding flags would end up (imo) very
tedious to maintain. There's a lot of sanitation to do: those flags
would be errors if not used in conjuction with `--quantize gptq`.
Then the flags need to exist in the launcher and the server passing them
all throughout all function calls.

This PR is intended as an easy escape hatch, not the defacto method to
use gptq in TGI.

Fixes #500
2023-07-12 10:00:02 +02:00
Nicolas Patry f0181436f4
fix(server): Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep). (#579)
Fixes #555
2023-07-12 09:51:34 +02:00
OlivierDehaene b4024edd45
feat: better errors for warmup and TP (#575)
Close #571
2023-07-10 14:47:15 +02:00
Nicolas Patry e943a294bc
fix(server): harden the weights choice to save on disk. (#561)
- Look at `transformers` base class to check for
  `_key_to_ignore_on_load_missing` or `_tied_weights` which are the
  standard attributes to select the keys to NOT save on disk (since they
  are ignored)

- Modified safetensors code (to be reflected in safetensors even if it's
  an internal function).
  
- Will not work for trust_remote_code=True repos (like santacoder).

Should help with :
https://github.com/huggingface/text-generation-inference/issues/555
and : https://github.com/huggingface/text-generation-inference/pull/501
and https://github.com/huggingface/text-generation-inference/issues/556
and
https://github.com/huggingface/text-generation-inference/issues/482#issuecomment-1623713593
2023-07-07 14:50:12 +02:00
OlivierDehaene 31b36cca21
v0.9.1 (#558) 2023-07-06 16:05:42 +02:00
OlivierDehaene c4bb5264ac
fix(server): decrease memory fragmentation (#557) 2023-07-06 14:28:33 +02:00
OlivierDehaene 6f42942772
feat(router): add argument for hostname in router (#545) (#550)
# What does this PR do?

In title. Adds argument `--hostname` in router to support something like
`--hostname ::`. Tested with

```commandline
cargo run -- --port 8080 --hostname ::
curl -I -X GET 'http://[::1]:8080/health'  # failed before this commit
```

Trigger CI

---------

Co-authored-by: Phil Chen <philchen2000@gmail.com>
2023-07-05 18:28:45 +02:00
OlivierDehaene 31e2253ae7
feat(server): use latest flash attention commit (#543)
@njhill FYI
2023-07-04 20:23:55 +02:00
Nick Hill e4b26aa10b
fix(server): avoid errors for very small top_p values (#544)
See https://github.com/huggingface/transformers/pull/24111

I didn't add validation to the `__init__` method since it's not done for
other values/warpers.
2023-07-04 20:11:33 +02:00
Antoni Baum 2a101207d4
fix(server): Handle loading from local files for MPT (#534)
This PR allows the MPT model to be loaded from local files. Without this
change, an exception will be thrown by `hf_hub_download` function if
`model_id` is a local path.
2023-07-04 18:37:25 +02:00
Nicolas Patry e6888d0e87
docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462) 2023-07-04 18:35:37 +02:00
Antoni Baum 8405581fcd
fix: Update server/Makefile to include Makefile-vllm (#520)
# What does this PR do?

For consistency and ease of use (you can just run `make` to install vllm
without any extra steps).

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->
2023-07-04 09:39:25 +02:00
Nicolas Patry 1da07e85aa
feat(server): Add Non flash MPT. (#514)
# What does this PR do?


This adds a non flash version of MPT.
Flash is harder because we need to create a bias ready cuda kernel of
flash attention.

Fixes
https://github.com/huggingface/text-generation-inference/issues/361
Fixes
https://github.com/huggingface/text-generation-inference/issues/491
Fixes
https://github.com/huggingface/text-generation-inference/issues/290
2023-07-03 13:01:46 +02:00
OlivierDehaene e28a809004
v0.9.0 (#525) 2023-07-01 19:25:41 +02:00
OlivierDehaene 2b53d71991
fix(launcher): fix issue where launcher does not properly report shard failures (#522) 2023-06-30 23:09:20 +02:00
Nicolas Patry ecf6dc3a5a
feat: Add the option to force another dtype than `f16`. (#513) 2023-06-30 20:30:09 +02:00
OlivierDehaene 3b0c979efc
feat(router): arg validation (#519) 2023-06-30 20:07:49 +02:00
OlivierDehaene e74bd41e0f
feat(server): add paged attention to flash models (#516)
Closes #478
2023-06-30 19:09:59 +02:00
Robert Kimball 70f485bf9f
feat(router): add header option to disable buffering for the generate_stream response (#498)
# This PR adds an http header option to disable buffering for the
generate_stream endpoint response stream.

Problem: If a model is run behind a proxy server such as nginx that has
buffering enabled then the response stream from generate_stream gets
aggregated into a single response which basically disables streaming.
Instead of getting a chunked response where each token is presented over
time the response presents everything all at once.

Solution: This change adds the `X-Accel-Buffering` http header which
disables buffering for the generate_stream response, allowing the
response to stream properly.
2023-06-28 11:50:12 +02:00
Antoni Baum ae466a8736
fix(server): Do not init process group if already initialized (#388) 2023-06-26 12:32:54 +02:00
Nicolas Patry aefde28b45
feat(server): Add inference support for GPTQ (llama + falcon tested) + Quantization script (#438)
Let's start discussing implementation.

- Need to expose the quantization scripts (either included here or add
doc on how to use https://github.com/qwopqwop200/GPTQ-for-LLaMa)
- Make sure GPTQ works for multiple models (priority to Falcon).

Currently it means that every place we use `get_{tensor|sharded}` to
check for quantization.

My idea is to reintegrate as much as possible into `utils/layer.py` by
expanding `load_multi` to be a bit more generic.
This might require some thinking, but ultimately the
`qweight,qzeros,scales,g_idx` should be in a single place, and
independant of bias presence.

# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

Fixes # (issue)


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-41-161.ec2.internal>
Co-authored-by: OlivierDehaene <olivier@huggingface.co>
2023-06-26 12:27:01 +02:00
OlivierDehaene bd3a9d8e85
fix(router): add timeout on flume sends (#488) 2023-06-23 14:58:28 +02:00