hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
drbh	f15e808d4c	fix: reject grammars without properties (#2309 )	2024-07-29 10:07:25 -04:00
Daniël de Kok	922732b255	Install Marlin from standalone package (#2320 )	2024-07-29 15:37:10 +02:00
Erik Kaunismäki	583d37a2f8	Run ci api key (#2315 ) * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>	2024-07-29 11:14:17 +02:00
Adrien	fd2e06316d	fix: fix buildkit config in ci Signed-off-by: Adrien <adrien@huggingface.co>	2024-07-29 09:25:56 +02:00
drbh	bab02ff2bc	feat: add ruff and resolve issue (#2262 ) * feat: add ruff and resolve issue * fix: update client exports and adjust after rebase * fix: adjust syntax to avoid circular import * fix: adjust client ruff settings * fix: lint and refactor import check and avoid model enum as global names * fix: improve fbgemm_gpu check and lints * fix: update lints * fix: prefer comparing model enum over str * fix: adjust lints and ignore specific rules * fix: avoid unneeded quantize check	2024-07-26 10:29:09 -04:00
Daniël de Kok	4b49c50f4c	Support tied embeddings in 0.5B and 1.5B Qwen2 models (#2313 )	2024-07-26 14:57:24 +02:00
Adrien	3905f854ed	Fix registry name (#2307 )	2024-07-25 16:06:00 +02:00
Nicolas Patry	17ed42be3a	Fixing idefics on g6 tests. (#2306 )	2024-07-25 14:44:21 +02:00
Daniël de Kok	9256d7c38c	Some small fixes for the Torch 2.4.0 update (#2304 ) * Fix GPTQ autotune data type to be compatible with Torch 2.4.0 * Update poetry lock file * Fix small PaliGemma logprob differences after the torch update	2024-07-25 13:34:44 +02:00
Nicolas Patry	26614057a7	Using g6 instead of g5. (#2281 ) * Using g6 instead of g5. * Update the idefics2 snapshot.	2024-07-25 11:21:17 +02:00
drbh	5d85a958c9	fix: refactor adapter weight loading and mapping (#2193 ) * fix: refactor adapter weight loading and mapping * feat: enable lora load from directory * fix: adjust launcher for local lora adapters * feat: improve weight loading and add tests * fix: improve logging and rebase syntax issue * fix: impove adapter merge comments and remove unused conditional * fix: improve get_model_with_lora_adapters naming * fix: comment typo	2024-07-24 15:32:14 -04:00
Daniël de Kok	93d2b9fe9c	Split up `layers.marlin` into several files (#2292 ) The marlin.py file was getting large, split it up.	2024-07-24 16:33:26 +02:00
Wang, Yi	8642250602	fix of use of unquantized weights in cohere GQA loading, also enable … (#2291 ) fix of use of unquantized weights in cohere GQA loading, also enable the model in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:44:02 +02:00
Wang, Yi	5ad39dd3c3	fix crash in multi-modal (#2245 ) * fix crash in multi-modal Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update according to review comment Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix llava_next regression in latest main Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-24 10:39:08 +02:00
OlivierDehaene	a895029424	hotfix: update nccl	2024-07-23 23:31:28 +02:00
OlivierDehaene	e7e3aa6cac	chore: update to torch 2.4 (#2259 ) * chore: update to torch 2.4 * remove un-necessary patch * fix	2024-07-23 20:39:43 +00:00
Daniël de Kok	bc9593a5b1	hotfix: pin numpy (#2289 )	2024-07-23 17:53:19 +02:00
Daniël de Kok	4ab4173767	Add support for Llama 3 rotary embeddings (#2286 ) * Add support for Llama 3 rotary embeddings * Update transformers to 4.43	2024-07-23 17:18:54 +02:00
Nicolas Patry	5d121a9705	Preparing for release. (#2285 ) * Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.	2024-07-23 16:20:17 +02:00
shaltielshmid	3961e32390	[WIP] Add support for Mistral-Nemo by supporting head_dim through config (#2254 ) * Support passing head_dim through config * Using `head_dim` as a fallback is necessary since it's a non standard key in mistralConfig (as defined in transformers). * Shorter diff. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-23 15:00:07 +02:00
Daniël de Kok	9935720c87	Add support for repacking AWQ weights for GPTQ-Marlin (#2278 ) * Add support for repacking AWQ weights for GPTQ-Marlin So far we couldn't support AWQ because virtually all AWQ models use symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin has recently added support AWQ repacking and AWQ asymmetric quantization (zero_point=True). This change updates all GPTQ-Marlin kernels from upstream and wires up AWQ support. For now enabling AWQ using Marlin requires running TGI with `--quantize gptq`. * Enable Marlin for supported AWQ configurations by default This makes the AWQ -> GPTQ repack test redundant, since we are now testing this with the regular AWQ test.	2024-07-23 13:08:20 +02:00
OlivierDehaene	5fca30ee15	fix(l4): fix fp8 logic on l4 (#2277 ) * fix(l4): fix fp8 logic on l4 * also quant weights with single scale * use marlin even on 89	2024-07-23 11:24:29 +02:00
Nicolas Patry	abc32537ea	Fixing mistral nemo. (#2276 )	2024-07-23 11:16:03 +02:00
Adrien	4700465192	use proper name for ci (#2274 )	2024-07-22 21:50:53 +02:00
Nicolas Patry	6aeb669072	Softcapping for gemma2. (#2273 ) * Softcapping for gemma2. * Less clutter. * No access to transformers config, only config_dict here. * 0.0 is the null value in the C++ API.	2024-07-22 18:27:10 +02:00
OlivierDehaene	4844ff790a	fix(server): fix fp8 weight loading (#2268 ) * fix(server): fix fp8 weight loading * fixed scales loading * update snap * revert default dtype	2024-07-22 15:51:32 +00:00
Adrien	6aebf44f47	fix(ci): test new instances (#2272 ) * test new instances Signed-off-by: Adrien <adrien@huggingface.co> * improve build ci Signed-off-by: Adrien <adrien@huggingface.co> --------- Signed-off-by: Adrien <adrien@huggingface.co>	2024-07-22 14:41:30 +02:00
Erik Kaunismäki	07441f5a7a	legacy warning on text_generation client (#2271 ) Update README.md point to huggingface_hub inference clients instead	2024-07-22 12:00:17 +02:00
icyboy™	4e4207224e	Hotfix: fix of use of unquantized weights in Mixtral GQA loading (#2269 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug * Hotfix: fix of use of unquantized weights in Mixtral GQA loading	2024-07-22 11:31:00 +02:00
OlivierDehaene	f3435bab8c	fix(server): fix deepseekv2 loading (#2266 )	2024-07-21 18:48:04 +02:00
OlivierDehaene	53ec0b790b	feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248 ) * feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout	2024-07-20 19:02:04 +02:00
Daniël de Kok	e5c1d6d611	Add FP8 release test (#2261 )	2024-07-20 10:26:06 +00:00
Adrien	11123a8e99	re-push to internal registry (#2242 ) * re-push to internal registry Signed-off-by: Adrien <adrien@huggingface.co> * fix name Signed-off-by: Adrien <adrien@huggingface.co> * debug Signed-off-by: Adrien <adrien@huggingface.co> * debug Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * wip debug Signed-off-by: Adrien <adrien@huggingface.co> * add debug Signed-off-by: Adrien <adrien@huggingface.co> * should Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * ww Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * ww Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * wip Signed-off-by: Adrien <adrien@huggingface.co> * debug Signed-off-by: Adrien <adrien@huggingface.co> * w Signed-off-by: Adrien <adrien@huggingface.co> * revert tests Signed-off-by: Adrien <adrien@huggingface.co> * last reverts Signed-off-by: Adrien <adrien@huggingface.co> * another one Signed-off-by: Adrien <adrien@huggingface.co> --------- Signed-off-by: Adrien <adrien@huggingface.co>	2024-07-20 05:06:40 +00:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
drbh	68a9685f1b	fix: adjust default tool choice (#2244 ) * fix: adjust default tool choice * feat: improve tool choice syntax and response parsing/errors * fix: remove dev tests * feat: add ToolChoice to docs	2024-07-19 11:12:02 -04:00
Erik Kaunismäki	40f5dc3ed6	add usage stats to toctree (#2260 ) quick fix	2024-07-19 16:34:04 +02:00
Erik Kaunismäki	4c19593a90	usage stats and crash reports (#2220 ) * draft of usage stats * fix wrong link * launcher doesn't need sysinfo dep * only tokenizer class instead of hole struct * unused import * fix clippy errors * update openAPI doc * cargo fmt * fix error in passing flags to router * try again to update docs * run pre-commit locally * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * on crash use anonymous error event * delete json_output and ngrok * more robust way of checking if is in container * more robust nvidia smi * parse xpu more robustly * fix errors * add nvidia-smi details in docs * cargo fmt * fix clippy * should make docs check pass * Update router/src/usage_stats.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * error reason can't be in nested json * cargo fmt --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>	2024-07-19 16:17:56 +02:00
Daniël de Kok	3f37a66774	Hotfix: pass through model revision in `VlmCausalLM` (#2258 )	2024-07-19 15:59:00 +02:00
Daniël de Kok	3b41e93a09	Hotfix: fix MPT after recent refactor (#2257 )	2024-07-19 14:42:35 +02:00
Daniël de Kok	18db78f295	Hotfix: various GPT-based model fixes (#2256 )	2024-07-19 14:42:19 +02:00
Daniël de Kok	80adb5be16	Hotfix: fix of use of unquantized weights in Gemma GQA loading (#2255 )	2024-07-19 12:55:59 +02:00
Daniël de Kok	ba291dad9f	Improve the handling of quantized weights (#2250 ) * Improve the handling of quantized weights Handling of quantized weights was split between two mechanisms: - For quantized checkpoints, we used the new weight loader infrastructure. - For quantization while loading (EETQ, FP8, bitsandbytes) we instead relied on conditional in `get_linear`. Weight loaders support context managers to selectively load particular layers with different weight loaders, which is useful for models like Idefics2 AWQ, which uses a quantized text model, but unquantized vision and connector models. However, the context manager would be overrided by `get_linear`, which string-checks `quantizer`. Also, the context manager would not work with EETQ, FP8, and bitsandbytes. This change migrates all quantizers to the weight loader infrastructure. This has several benefits: - We can use context managers with all quantizers. - All the implementation details move down to the quantizer layers, `get_linear` does not need to know how to handle quantizer linear layers. - All quantizer weights are strongly typed, we don't pass around raw tensors. - We don't have to pass around the `quantizer` string everywhere. * Exclude non-MLP layers when using FP8 quantization with Llama	2024-07-19 09:37:39 +02:00
OlivierDehaene	1d1b1efa01	fix(server): fix cohere (#2249 )	2024-07-18 16:00:13 +02:00
Daniël de Kok	da82c63a4f	Remove stray `quantize` argument in `get_weights_col_packed_qkv` (#2237 ) Fixes #2236.	2024-07-16 09:30:57 +02:00
Daniël de Kok	2cb1842852	`server quantize`: expose groupsize option (#2225 )	2024-07-16 08:36:05 +02:00
Daniël de Kok	06d0e880e0	Add support for AWQ-quantized Idefics2 (#2233 ) Fixes #2036.	2024-07-16 07:58:25 +02:00
Hugo Larcher	0ad7f6f87d	fix: Remove bitsandbytes installation when running cpu-only install (#2216 ) Remove bitsandbytes installation when running cpu-only install	2024-07-15 15:34:20 +02:00
Erik Kaunismäki	457fb0a188	fix custom cache dir (#2226 ) * fix to not ignore HUGGINGFACE_HUB_CACHE in cache * delete printlns * delete newlines * maybe fix trailing whitespace	2024-07-15 15:17:13 +02:00
drbh	5a65066922	feat: simple mistral lora integration tests (#2180 ) * feat: simple mistral lora integration tests * fix: include args in docker launcher * fix: disable cuda graphs with lora and warn * fix: adjust docs and precommit issues * fix: re update docs	2024-07-15 09:16:15 -04:00
Daniël de Kok	dbb23fbfa8	Use symmetric quantization in the `quantize` subcommand (#2120 ) Packing of asymmetric quantization is broken, all (q)zeros values of `0` get reset to `1`, resulting in a loss of accuracy. So instead use symmetric quantization. To be able to distinguish models with symmetric and asymmetric quantization, a new config tensor `gptq_sym` is added. If this tensor is not present, we assume `sym=False`.	2024-07-12 12:20:12 +02:00

1 2 3 4 5 ...

1003 Commits All Branches Search

1003 Commits

All Branches