Jeff
5b2155b0f8
corrected Pydantic warning. ( #2095 )
...
* corrected Pydantic warning.
* Update clients/python/text_generation/types.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
2024-06-25 10:10:32 +02:00
KevinDuffy94
1869ee2f57
Add OTLP Service Name Environment Variable ( #2076 )
...
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069
* Update Docs
* Update README.md
* Update Launcher Docs
* Update Launcher Docs
Removing Option
2024-06-25 09:33:01 +02:00
Lucain
3447c722fd
Support `HF_TOKEN` environment variable ( #2066 )
...
* Support HF_TOKEN environement variable
* Load test.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-06-25 09:23:12 +02:00
Felix Marty
09a41f2c43
do not skip workflow on cuda, fix no space left no device
2024-06-24 18:51:59 +02:00
Felix Marty
f16f0ad92b
do not login to internal registry
2024-06-24 18:51:58 +02:00
Felix Marty
13bbf6cc5c
does ci pass without tailscale?
2024-06-24 18:51:33 +02:00
Felix Marty
ee62872d66
test tailscale independently
2024-06-24 18:51:33 +02:00
Felix Marty
1bb1a344d7
retry
2024-06-24 18:51:33 +02:00
Felix Marty
bc2b9b20e2
trigger ci
2024-06-24 18:51:32 +02:00
Felix Marty
3464d60d4b
The handshake operation timed out & hanging
2024-06-24 18:51:32 +02:00
Felix Marty
284894303a
remove require_backend decorators on handles, for some reasons fails in github actions
2024-06-24 18:51:32 +02:00
Felix Marty
7e0f4f25c7
renamed file
2024-06-24 18:51:32 +02:00
Felix Marty
393234de9b
hopefully fix ci
2024-06-24 18:51:32 +02:00
Felix Marty
67999773f3
fix workflow
2024-06-24 18:51:32 +02:00
Felix Marty
5fb8c275c3
fix style & typo
2024-06-24 18:51:30 +02:00
Felix Marty
e62ac4d63a
trigger
2024-06-24 18:51:09 +02:00
fxmarty
df7bb11793
dial tcp: lookup registry-1.docker.io: i/o timeout
2024-06-24 18:51:08 +02:00
fxmarty
40b342a12e
fix space
2024-06-24 18:51:08 +02:00
fxmarty
3de8f3647b
fix decorators
2024-06-24 18:51:08 +02:00
fxmarty
4616c62914
style
2024-06-24 18:51:08 +02:00
Felix Marty
5b6b257756
fix gpt2 tests - some weights were not contiguous
2024-06-24 18:51:08 +02:00
Felix Marty
9e50c117bc
fix idefics2 tests
2024-06-24 18:51:06 +02:00
fxmarty
1846c1c210
fix tests
2024-06-24 18:50:18 +02:00
fxmarty
1e10597d0c
update
2024-06-24 18:50:17 +02:00
fxmarty
406885638b
skip exl2 tests on rocm
2024-06-24 18:49:45 +02:00
fxmarty
5a4b798f98
fix gptq tests, LLMM1 matrix bound
2024-06-24 18:49:45 +02:00
fxmarty
49db30a137
disable marlin tests on rocm/xpu
2024-06-24 18:49:37 +02:00
ur4t
405765b18c
Fix cargo-chef prepare ( #2101 )
...
* Fix cargo-chef prepare
In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly.
If Cargo.lock is not present, cargo-chef will generate a new one first, which
might vary a lot and invalidate docker build caches.
* Fix Dockerfile_amd and Dockerfile_intel
2024-06-24 18:16:36 +02:00
Nicolas Patry
480d3b3304
New runner. Manual squash. ( #2110 )
...
* New runner. Manual squash.
* Network host.
* Put back trufflehog with proper extension.
* No network host ?
* Moving buildx install after tailscale ?
* 1.79
2024-06-24 18:08:34 +02:00
drbh
811a9381b1
feat: sort cuda graphs in descending order ( #2104 )
2024-06-21 14:28:26 -04:00
Daniël de Kok
197c47a302
Fix `text-generation-server quantize` ( #2103 )
...
The subcommand did not work due to some broken imports.
2024-06-21 15:28:51 +02:00
Daniël de Kok
bcb3faa1c2
Factor out sharding of packed tensors ( #2059 )
...
For Phi-3-Small I need to shard a packed QKV bias tensor, for which
I implemented the `Weights.get_packed_sharded` method. However, this
method can also replace the `Weights._get_qweight` method and the
custom sharding code from `Weights.get_weights_col_packed`.
2024-06-20 09:56:04 +02:00
Daniël de Kok
f5a9837592
Support exl2-quantized Qwen2 models ( #2085 )
...
Fixes #2081 .
2024-06-20 07:56:16 +02:00
drbh
cdbf802860
feat: rotate tests ci token ( #2091 )
2024-06-19 17:02:58 -04:00
Daniël de Kok
11ea9ce002
CI: pass pre-commit hooks again ( #2084 )
2024-06-18 09:38:21 +02:00
Guillaume LEGENDRE
4f25c67d63
CI: Tailscale improvements ( #2079 )
...
* test local tailscale
* Update build.yaml
* Update build.yaml
* Update build.yaml
* Update build.yaml
* wait for ssh
* network host
* change step order
2024-06-18 09:13:04 +02:00
Daniël de Kok
c8c7ccd31e
Set maximum grpc message receive size to 2GiB ( #2075 )
...
* Set maximum grpc message receive size to 2GiB
The previous default was 4MiB, which doesn't really work well for
multi-modal models.
* Update to Rust 1.79.0
* Fixup formatting to make PR pass
2024-06-17 16:40:44 +02:00
Ziru Niu
0f7d38e774
fix build.rs watch files ( #2072 )
2024-06-17 12:10:01 +02:00
Lysandre Debut
131838919e
Contributing guide & Code of Conduct ( #2074 )
...
* Contributing guide & Code of Conduct
* Redirect to GitHub's tutorial on PRs
2024-06-17 12:09:31 +02:00
Daniël de Kok
e903770897
Support different image sizes in prefill in VLMs ( #2065 )
...
When a batch contained images if different sizes during prefill, the
server would fail (see e.g. #2056 ). Images were processed separately and
then concatenated. However, this can fail for images with different sizes.
Fix this by preprocessing all images in the batch together, so that the
image processor can ensure that all image tensors have compatible sizes.
2024-06-17 10:49:41 +02:00
Alvaro Moran
445f313504
Adding architecture document ( #2044 )
...
* doc: adding architecture document
* doc: add architecture to toctree
* fix: avoid cargo lock changes
* fix: avoid cargo lock tweak
---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-06-14 09:28:34 -04:00
Tiezhen WANG
96b7b40ca3
Update the link for qwen2 ( #2068 )
...
* Update the link for qwen2
* Fix Qwen2 model URL in model table
* Fix too eager staging
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-06-14 11:59:33 +02:00
Daniël de Kok
093a27c528
Add support for GPTQ Marlin ( #2052 )
...
Add support for GPTQ Marlin kernels
GPTQ Marlin extends the Marlin kernels to support common GPTQ
configurations:
- bits: 4 or 8
- groupsize: -1, 32, 64, or 128
- desc_act: true/false
Using the GPTQ Marlin kernels requires repacking the parameters in the
Marlin quantizer format.
The kernels were contributed by Neural Magic to VLLM. We vendor them
here for convenience.
2024-06-14 09:45:42 +02:00
drbh
f433f1f770
implement Open Inference Protocol endpoints ( #1942 )
...
* feat: add kserve feature and basic routes
* feat: implement infer endpoint wrapper around generate
* fix: refactor and improve types
* fix: improve infer and simplify
* fix: cleanup and improve api docs
* fix: refactor and encapsulate kserve feat in file
* fix: remove typos after rebase
2024-06-13 12:51:51 -04:00
drbh
42aa8ee1bb
PR #2049 CI run ( #2054 )
...
* Use minijinja's pycompat mode for python methods
* fix: cargo fmt lint for pre commit
---------
Co-authored-by: Armin Ronacher <armin.ronacher@active-4.com>
2024-06-13 11:53:49 -04:00
OlivierDehaene
90184df79c
fix(layers): fix SuRotaryEmbedding ( #2060 )
...
* fix(layers): fix SuRotaryEmbedding
* change arange
* remove logs
2024-06-12 18:24:47 +02:00
OlivierDehaene
521de6cacd
fix(server): fix OPT implementation ( #2061 )
2024-06-12 18:22:20 +02:00
drbh
376a0b7ada
Support chat response format ( #2046 )
...
* feat: support response_format in chat
* fix: adjust typos
* fix: add trufflehog lint
2024-06-11 10:44:56 -04:00
fxmarty
a6e4d63c86
Update LLMM1 bound ( #2050 )
...
update commit
2024-06-11 19:30:29 +08:00
Luc Georges
dfca1dfc5e
fix(ci): remove unnecessary permissions ( #2045 )
2024-06-10 12:16:53 -04:00