Commit Graph

1003 Commits

Author SHA1 Message Date
drbh f15e808d4c
fix: reject grammars without properties (#2309) 2024-07-29 10:07:25 -04:00
Daniël de Kok 922732b255
Install Marlin from standalone package (#2320) 2024-07-29 15:37:10 +02:00
Erik Kaunismäki 583d37a2f8
Run ci api key (#2315)
* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------

Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>
2024-07-29 11:14:17 +02:00
Adrien fd2e06316d
fix: fix buildkit config in ci
Signed-off-by: Adrien <adrien@huggingface.co>
2024-07-29 09:25:56 +02:00
drbh bab02ff2bc
feat: add ruff and resolve issue (#2262)
* feat: add ruff and resolve issue

* fix: update client exports and adjust after rebase

* fix: adjust syntax to avoid circular import

* fix: adjust client ruff settings

* fix: lint and refactor import check and avoid model enum as global names

* fix: improve fbgemm_gpu check and lints

* fix: update lints

* fix: prefer comparing model enum over str

* fix: adjust lints and ignore specific rules

* fix: avoid unneeded quantize check
2024-07-26 10:29:09 -04:00
Daniël de Kok 4b49c50f4c
Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313) 2024-07-26 14:57:24 +02:00
Adrien 3905f854ed
Fix registry name (#2307) 2024-07-25 16:06:00 +02:00
Nicolas Patry 17ed42be3a
Fixing idefics on g6 tests. (#2306) 2024-07-25 14:44:21 +02:00
Daniël de Kok 9256d7c38c
Some small fixes for the Torch 2.4.0 update (#2304)
* Fix GPTQ autotune data type to be compatible with Torch 2.4.0

* Update poetry lock file

* Fix small PaliGemma logprob differences after the torch update
2024-07-25 13:34:44 +02:00
Nicolas Patry 26614057a7
Using g6 instead of g5. (#2281)
* Using g6 instead of g5.

* Update the idefics2 snapshot.
2024-07-25 11:21:17 +02:00
drbh 5d85a958c9
fix: refactor adapter weight loading and mapping (#2193)
* fix: refactor adapter weight loading and mapping

* feat: enable lora load from directory

* fix: adjust launcher for local lora adapters

* feat: improve weight loading and add tests

* fix: improve logging and rebase syntax issue

* fix: impove adapter merge comments and remove unused conditional

* fix: improve get_model_with_lora_adapters naming

* fix: comment typo
2024-07-24 15:32:14 -04:00
Daniël de Kok 93d2b9fe9c
Split up `layers.marlin` into several files (#2292)
The marlin.py file was getting large, split it up.
2024-07-24 16:33:26 +02:00
Wang, Yi 8642250602
fix of use of unquantized weights in cohere GQA loading, also enable … (#2291)
fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-24 10:44:02 +02:00
Wang, Yi 5ad39dd3c3
fix crash in multi-modal (#2245)
* fix crash in multi-modal

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update according to review comment

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix llava_next regression in latest main

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-24 10:39:08 +02:00
OlivierDehaene a895029424
hotfix: update nccl 2024-07-23 23:31:28 +02:00
OlivierDehaene e7e3aa6cac
chore: update to torch 2.4 (#2259)
* chore: update to torch 2.4

* remove un-necessary patch

* fix
2024-07-23 20:39:43 +00:00
Daniël de Kok bc9593a5b1
hotfix: pin numpy (#2289) 2024-07-23 17:53:19 +02:00
Daniël de Kok 4ab4173767
Add support for Llama 3 rotary embeddings (#2286)
* Add support for Llama 3 rotary embeddings

* Update transformers to 4.43
2024-07-23 17:18:54 +02:00
Nicolas Patry 5d121a9705
Preparing for release. (#2285)
* Preparing for release.

* Updating docs.

* Fixing token within the docker image for the launcher.
2024-07-23 16:20:17 +02:00
shaltielshmid 3961e32390
[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254)
* Support passing head_dim through config

* Using `head_dim` as a fallback is necessary since it's a non standard
key in mistralConfig (as defined in transformers).

* Shorter diff.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-23 15:00:07 +02:00
Daniël de Kok 9935720c87
Add support for repacking AWQ weights for GPTQ-Marlin (#2278)
* Add support for repacking AWQ weights for GPTQ-Marlin

So far we couldn't support AWQ because virtually all AWQ models use
symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
has recently added support AWQ repacking and AWQ asymmetric quantization
(zero_point=True).

This change updates all GPTQ-Marlin kernels from upstream and wires up
AWQ support. For now enabling AWQ using Marlin requires running TGI with
`--quantize gptq`.

* Enable Marlin for supported AWQ configurations by default

This makes the AWQ -> GPTQ repack test redundant, since we are now
testing this with the regular AWQ test.
2024-07-23 13:08:20 +02:00
OlivierDehaene 5fca30ee15
fix(l4): fix fp8 logic on l4 (#2277)
* fix(l4): fix fp8 logic on l4

* also quant weights with single scale

* use marlin even on 89
2024-07-23 11:24:29 +02:00
Nicolas Patry abc32537ea
Fixing mistral nemo. (#2276) 2024-07-23 11:16:03 +02:00
Adrien 4700465192
use proper name for ci (#2274) 2024-07-22 21:50:53 +02:00
Nicolas Patry 6aeb669072
Softcapping for gemma2. (#2273)
* Softcapping for gemma2.

* Less clutter.

* No access to transformers config, only config_dict here.

* 0.0 is the null value in the C++ API.
2024-07-22 18:27:10 +02:00
OlivierDehaene 4844ff790a
fix(server): fix fp8 weight loading (#2268)
* fix(server): fix fp8 weight loading

* fixed scales loading

* update snap

* revert default dtype
2024-07-22 15:51:32 +00:00
Adrien 6aebf44f47
fix(ci): test new instances (#2272)
* test new instances

Signed-off-by: Adrien <adrien@huggingface.co>

* improve build ci

Signed-off-by: Adrien <adrien@huggingface.co>

---------

Signed-off-by: Adrien <adrien@huggingface.co>
2024-07-22 14:41:30 +02:00
Erik Kaunismäki 07441f5a7a
legacy warning on text_generation client (#2271)
Update README.md

point to huggingface_hub inference clients instead
2024-07-22 12:00:17 +02:00
icyboy™ 4e4207224e
Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269)
* Update idefics_causal_lm.py

Fix syntax issues

* fix dbrx & opt model prefix bug

* Hotfix: fix of use of unquantized weights in Mixtral GQA loading
2024-07-22 11:31:00 +02:00
OlivierDehaene f3435bab8c
fix(server): fix deepseekv2 loading (#2266) 2024-07-21 18:48:04 +02:00
OlivierDehaene 53ec0b790b
feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248)
* feat(fp8): add support for fbgemm

* allow loading fp8 weights directly

* update outlines

* fix makefile

* build fbgemm

* avoid circular import and fix dockerfile

* add default dtype

* refactored weights loader

* fix auto conversion

* fix quantization config parsing

* force new nccl on install

* missing get_weights implementation

* increase timeout
2024-07-20 19:02:04 +02:00
Daniël de Kok e5c1d6d611
Add FP8 release test (#2261) 2024-07-20 10:26:06 +00:00
Adrien 11123a8e99
re-push to internal registry (#2242)
* re-push to internal registry

Signed-off-by: Adrien <adrien@huggingface.co>

* fix name

Signed-off-by: Adrien <adrien@huggingface.co>

* debug

Signed-off-by: Adrien <adrien@huggingface.co>

* debug

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip debug

Signed-off-by: Adrien <adrien@huggingface.co>

* add debug

Signed-off-by: Adrien <adrien@huggingface.co>

* should

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* ww

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* ww

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* wip

Signed-off-by: Adrien <adrien@huggingface.co>

* debug

Signed-off-by: Adrien <adrien@huggingface.co>

* w

Signed-off-by: Adrien <adrien@huggingface.co>

* revert tests

Signed-off-by: Adrien <adrien@huggingface.co>

* last reverts

Signed-off-by: Adrien <adrien@huggingface.co>

* another one

Signed-off-by: Adrien <adrien@huggingface.co>

---------

Signed-off-by: Adrien <adrien@huggingface.co>
2024-07-20 05:06:40 +00:00
Daniël de Kok e52be9bba2
Add support for Deepseek V2 (#2224)
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
2024-07-19 17:23:20 +02:00
drbh 68a9685f1b
fix: adjust default tool choice (#2244)
* fix: adjust default tool choice

* feat: improve tool choice syntax and response parsing/errors

* fix: remove dev tests

* feat: add ToolChoice to docs
2024-07-19 11:12:02 -04:00
Erik Kaunismäki 40f5dc3ed6
add usage stats to toctree (#2260)
quick fix
2024-07-19 16:34:04 +02:00
Erik Kaunismäki 4c19593a90
usage stats and crash reports (#2220)
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
2024-07-19 16:17:56 +02:00
Daniël de Kok 3f37a66774
Hotfix: pass through model revision in `VlmCausalLM` (#2258) 2024-07-19 15:59:00 +02:00
Daniël de Kok 3b41e93a09
Hotfix: fix MPT after recent refactor (#2257) 2024-07-19 14:42:35 +02:00
Daniël de Kok 18db78f295
Hotfix: various GPT-based model fixes (#2256) 2024-07-19 14:42:19 +02:00
Daniël de Kok 80adb5be16
Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255) 2024-07-19 12:55:59 +02:00
Daniël de Kok ba291dad9f
Improve the handling of quantized weights (#2250)
* Improve the handling of quantized weights

Handling of quantized weights was split between two mechanisms:

- For quantized checkpoints, we used the new weight loader
  infrastructure.
- For quantization while loading (EETQ, FP8, bitsandbytes) we
  instead relied on conditional in `get_linear`.

Weight loaders support context managers to selectively load
particular layers with different weight loaders, which is useful
for models like Idefics2 AWQ, which uses a quantized text model,
but unquantized vision and connector models. However, the context
manager would be overrided by `get_linear`, which string-checks
`quantizer`. Also, the context manager would not work with
EETQ, FP8, and bitsandbytes.

This change migrates all quantizers to the weight loader infrastructure.
This has several benefits:

- We can use context managers with all quantizers.
- All the implementation details move down to the quantizer layers,
  `get_linear` does not need to know how to handle quantizer linear
  layers.
- All quantizer weights are strongly typed, we don't pass around
  raw tensors.
- We don't have to pass around the `quantizer` string everywhere.

* Exclude non-MLP layers when using FP8 quantization with Llama
2024-07-19 09:37:39 +02:00
OlivierDehaene 1d1b1efa01
fix(server): fix cohere (#2249) 2024-07-18 16:00:13 +02:00
Daniël de Kok da82c63a4f
Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237)
Fixes #2236.
2024-07-16 09:30:57 +02:00
Daniël de Kok 2cb1842852
`server quantize`: expose groupsize option (#2225) 2024-07-16 08:36:05 +02:00
Daniël de Kok 06d0e880e0
Add support for AWQ-quantized Idefics2 (#2233)
Fixes #2036.
2024-07-16 07:58:25 +02:00
Hugo Larcher 0ad7f6f87d
fix: Remove bitsandbytes installation when running cpu-only install (#2216)
Remove bitsandbytes installation when running cpu-only install
2024-07-15 15:34:20 +02:00
Erik Kaunismäki 457fb0a188
fix custom cache dir (#2226)
* fix to not ignore HUGGINGFACE_HUB_CACHE in cache

* delete printlns

* delete newlines

* maybe fix trailing whitespace
2024-07-15 15:17:13 +02:00
drbh 5a65066922
feat: simple mistral lora integration tests (#2180)
* feat: simple mistral lora integration tests

* fix: include args in docker launcher

* fix: disable cuda graphs with lora and warn

* fix: adjust docs and precommit issues

* fix: re update docs
2024-07-15 09:16:15 -04:00
Daniël de Kok dbb23fbfa8
Use symmetric quantization in the `quantize` subcommand (#2120)
Packing of asymmetric quantization is broken, all (q)zeros values
of `0` get reset to `1`, resulting in a loss of accuracy. So instead
use symmetric quantization. To be able to distinguish models with
symmetric and asymmetric quantization, a new config tensor `gptq_sym` is
added. If this tensor is not present, we assume `sym=False`.
2024-07-12 12:20:12 +02:00