Commit Graph

855 Commits

Author SHA1 Message Date
Felix Marty c2f4b7f93e add vicuna 2024-07-02 08:25:12 +00:00
Felix Marty e0bfe4e7f0 fix 2024-07-01 12:31:56 +00:00
Felix Marty 750ef7bc23 Merge branch 'ci_amd3' of github.com:huggingface/text-generation-inference into ci_amd3 2024-07-01 12:20:40 +00:00
Felix Marty 00cc73b7b7 fix post merge 2024-07-01 12:20:29 +00:00
fxmarty 59849777de Merge branch 'main' into ci_amd3 2024-07-01 14:14:46 +02:00
Felix Marty 9fd395fae4 fix tests 2024-07-01 12:12:26 +00:00
Daniël de Kok 2ce8019480
Use GPTQ-Marlin for supported GPTQ configurations (#2111)
GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
let's use it by default if the kernels are installed, the GPU supports
it, and the kernels support the configuration.

For models generated by `text-generation-server quantize`, use
`sym=False`. This subcommand symmetric quantization since the beginning
and incorrectly reporting the model to be symmetric will use
GPTQ-Marlin (which does not support asymmetric quantization).
2024-07-01 12:59:12 +02:00
drbh 0d97a93c1e
feat: download lora adapter weights from launcher (#2140) 2024-07-01 12:58:49 +02:00
drbh 25f57e2e98
fix: use weights from base_layer (#2141) 2024-07-01 12:58:40 +02:00
Nicolas Patry b4552f9de9
Fixing clippy. (#2149) 2024-07-01 12:02:19 +02:00
Wang, Yi 6ea570ddfe
fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148)
* fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices]

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-01 11:27:53 +02:00
Felix Marty 05d1011b4f fix xpu build 2024-06-28 16:08:27 +00:00
Felix Marty 68583d3240 working memory leak fix in tunableop 2024-06-28 15:15:12 +00:00
Felix Marty 3d50ff71b7 bump torch to more recent version 2024-06-28 13:10:43 +00:00
Felix Marty 87db820627 fix rm 2024-06-28 09:49:20 +00:00
Nicolas Patry fb98ab273f
Fixing the CI to also run in release when it's a tag ? (#2138) 2024-06-28 09:31:09 +02:00
drbh 74b0231b19
fix: refactor post_processor logic and add test (#2137)
* fix: refactor post_processor logic and add test

* fix: remove dev comment

* fix: adjust when post_processor is overridden and  improve create_post_processor
2024-06-27 23:16:19 +02:00
Felix Marty eaa6890b3c remove hidden 2024-06-27 15:24:14 +00:00
Felix Marty 0a5485d8a0 avoid permissions issues 2024-06-27 14:51:11 +00:00
Nicolas Patry 3ea8259af1
Fixing gemma2. (#2135)
* Fixing gemma2.

* Adding new model.
2024-06-27 16:04:20 +02:00
Nicolas Patry 0e4ab6d31c
Fixing malformed rust tokenizers (#2134)
* Fixing malformed rust tokenizers

* Fix for deepseek too.
2024-06-27 16:04:03 +02:00
Daniël de Kok dd2d91b043
Idefics2: sync added image tokens with transformers (#2080)
Before this change, the number of reserved image tokens was not the
same as the number of images. Fixes #2029.

While at it, also remove all the image token handling duplication
in `prepare_input`.
2024-06-27 15:54:35 +02:00
Felix Marty bbc949ff74 trigger ci 2024-06-27 13:47:21 +00:00
Nicolas Patry b53b21c63a
Bumping to 2.1 (#2131) 2024-06-27 12:34:43 +02:00
Nicolas Patry bcfcd4740a
Fixing prom leak by upgrading. (#2129) 2024-06-27 08:08:43 +02:00
Felix Marty 60a96a9ae3 do not use private registry in cleanup cache step 2024-06-26 13:57:05 +00:00
Felix Marty 4067fc8211 login to registry 2024-06-26 10:58:52 +00:00
Felix Marty 2330052aa2 debug 2024-06-26 10:43:57 +00:00
fxmarty 227f78f3fe Merge branch 'main' into ci_amd3 2024-06-26 12:08:42 +02:00
Felix Marty b44097a61b fix cache cleanup 2024-06-26 10:02:45 +00:00
drbh be2d38032a
fix: simplify kserve endpoint and fix imports (#2119) 2024-06-25 19:30:10 -04:00
Daniël de Kok f1f98e369f
Add support for Marlin 2:4 sparsity (#2102)
This change adds support for 2:4 sparsity when using Marlin
quantization. The 2:4 kernel is used when:

* The quantizer is `marlin`;
* the quantizer checkpoint format is `marlin_24`.

Fixes #2098.
2024-06-25 21:09:42 +02:00
Daniël de Kok 14980df2df
Support AWQ quantization with bias (#2117)
When the AWQ quantizer was used with a layer that uses a bias,
the bias tensor was not correctly passed/used. Instead, the
value `true`/`1.0` was added to the linear transformation.

Correctly pass through the bias when it is not `None`.

Fixes #2106.
2024-06-25 21:09:00 +02:00
drbh 04e1af94d7
Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-06-25 14:46:27 -04:00
Nicolas Patry a2a97b05d6
Fix CI . (#2118)
Fix clippy.
2024-06-25 17:53:36 +02:00
Daniël de Kok fc9c3153e5
Add pytest release marker (#2114)
* Add pytest release marker

Annotate a test with `@pytest.mark.release` and it only gets run
with `pytest integration-tests --release`.

* Mark many models as `release` to speed up CI
2024-06-25 16:53:20 +02:00
Wang, Yi e563983d90
fix cpu and xpu issue (#2116)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-06-25 16:47:06 +02:00
Nicolas Patry 9e2fdf57c0
Removing IPEX_AVAIL. (#2115)
* Removing IPEX_AVAIL.

Chose to unify CPU and XPU under `ipex`. Most code is exactly similar
except for a very few spots.

The biggest number of spots is the kv-cache layout and the flash_xxx.py
files.
Since those files should be removed soon and factored away, we should
not need them.

* Forgot a few places.

* Unrelated change.

* Fixing HF_TOKEN.

* HF_TOKEN
2024-06-25 13:20:57 +02:00
drbh 3f3b7ffd67
feat: add simple tests for weights (#2092)
* feat: add simple tests for weights

* fix: adjust types and add tests

* fix: adjust so all tests pass

* feat: improve weight tests

* fix: add missing tests and renames

* fix: tweak shapes
2024-06-25 12:22:59 +02:00
Wang, Yi b64c70c9e7
Cpu tgi (#1936)
* add CPU tgi support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* ipex distributed ops support

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
2024-06-25 12:21:29 +02:00
Felix Marty 04298e5799 add back credentials 2024-06-25 09:22:49 +00:00
fxmarty dc53846456 Merge branch 'main' into ci_amd3 2024-06-25 11:20:00 +02:00
sunxichen b69f078041
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089)
Co-authored-by: sunxichen <sun.xc@digitalcnzz.com>
2024-06-25 10:59:50 +02:00
Wang, Yi 83634dc122
use xpu-smi to dump used memory (#2047)
* use xpu-smi to dump used memory
xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Update server/text_generation_server/utils/import_utils.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:15:46 +02:00
Jeff 5b2155b0f8
corrected Pydantic warning. (#2095)
* corrected Pydantic warning.

* Update clients/python/text_generation/types.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:10:32 +02:00
KevinDuffy94 1869ee2f57
Add OTLP Service Name Environment Variable (#2076)
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option
2024-06-25 09:33:01 +02:00
Lucain 3447c722fd
Support `HF_TOKEN` environment variable (#2066)
* Support HF_TOKEN environement variable

* Load test.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-06-25 09:23:12 +02:00
Felix Marty 09a41f2c43
do not skip workflow on cuda, fix no space left no device 2024-06-24 18:51:59 +02:00
Felix Marty f16f0ad92b
do not login to internal registry 2024-06-24 18:51:58 +02:00
Felix Marty 13bbf6cc5c
does ci pass without tailscale? 2024-06-24 18:51:33 +02:00