hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
fxmarty	4e3f687427	use base docker image	2024-07-08 13:10:09 +02:00
fxmarty	8c590be463	Merge branch 'main' into ci_amd3	2024-07-08 13:06:39 +02:00
Daniël de Kok	153fcf7739	Fix incorrect cache allocation with multi-query (#2203 ) We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.	2024-07-08 11:19:48 +02:00
Daniël de Kok	cce475a949	hotfix: Fix number of KV heads (#2202 ) Fix number of KV heads	2024-07-08 09:52:12 +02:00
icyboy™	521d0d990f	fix dbrx & opt model prefix bug (#2201 ) * Update idefics_causal_lm.py Fix syntax issues * fix dbrx & opt model prefix bug	2024-07-08 09:01:14 +02:00
Daniël de Kok	05c094fcfa	Consistently take `prefix` in model constructors (#2191 ) * Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes	2024-07-05 16:07:48 +02:00
Daniël de Kok	67ef0649cf	GPTQ CI improvements (#2151 ) * Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build	2024-07-05 14:12:16 +02:00
Daniël de Kok	b67d46336e	Fix Starcoder2 after refactor (#2189 )	2024-07-05 12:22:45 +02:00
Nicolas Patry	853d4eb9cf	Hotfixing after refactor.	2024-07-05 09:25:29 +00:00
Nicolas Patry	fb2f74e2b9	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-07-05 10:29:56 +02:00
Aaron Mihalik	c6bcadf883	Adding "longrope" for Phi-3 (#2172 ) (#2179 ) Adding "longrope" for phi-3	2024-07-05 09:46:41 +02:00
Nicolas Patry	245d3de948	Preparing patch release. (#2186 )	2024-07-04 10:55:33 +02:00
Nicolas Patry	5ad41aa2a6	Fixing missing `object` field for regular completions. (#2175 ) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.	2024-07-03 12:56:27 +02:00
Nicolas Patry	2b3bd1e008	Fixing the dockerfile warnings. (#2173 )	2024-07-03 12:48:45 +02:00
Nicolas Patry	be4a4c47f9	Revert "Fixing missing `object` field for regular completions." This reverts commit `2bbb7fa4b2`.	2024-07-03 10:41:39 +00:00
Nicolas Patry	2bbb7fa4b2	Fixing missing `object` field for regular completions.	2024-07-03 10:40:22 +00:00
drbh	571530dd9a	feat: improve update_docs for openapi schema (#2169 ) * feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff	2024-07-03 09:53:35 +02:00
fxmarty	29a416078c	Merge branch 'main' into ci_amd3	2024-07-02 15:32:53 +02:00
Felix Marty	add4d42cb3	do not use tunableop for non flash-causal-lm modezls	2024-07-02 12:52:55 +00:00
Nicolas Patry	0759ec495e	Hotfixing qwen2 and starcoder2 (which also get clamping). (#2167 )	2024-07-02 14:26:47 +02:00
Guillaume LEGENDRE	963b6c6f0f	Ci test (#2124 ) * first test with registry mirror * change push registry * remove comments * Move cache to push registry * fix registry url * Update .github/workflows/ci_build.yaml --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-02 12:45:38 +02:00
Nicolas Patry	dea9c0dc74	Fixing rocm. (#2164 )	2024-07-02 12:01:08 +02:00
drbh	b966bc0d35	fix: use the base layers weight in mistral rocm (#2155 )	2024-07-02 11:56:25 +02:00
Wang, Yi	5d97e0c4a3	fix FlashDecoding change's regression in intel platform (#2161 ) install triton because GPTQParams needs it. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-02 11:56:07 +02:00
Nicolas Patry	022f6515a4	Fixing graph capture for flash decoding. (#2163 )	2024-07-02 11:43:07 +02:00
Felix Marty	c2f4b7f93e	add vicuna	2024-07-02 08:25:12 +00:00
Nicolas Patry	4327210e6b	[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-07-01 23:28:00 +02:00
Nicolas Patry	4f55f15840	Fixing baichuan override. (#2158 )	2024-07-01 23:25:54 +02:00
Nicolas Patry	d0225b1015	GH router. (#2153 )	2024-07-01 15:42:26 +02:00
Nicolas Patry	17cebc4506	Fixing test. (#2152 )	2024-07-01 15:24:17 +02:00
drbh	9eefb2f672	fix: prefer serde structs over custom functions (#2127 ) * fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 15:08:05 +02:00
Wang, Yi	5da4cfab1c	refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132 ) * refine get xpu free memory Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable qwen2 in xpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable gemma/gemma2/phi in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-01 14:32:54 +02:00
Felix Marty	e0bfe4e7f0	fix	2024-07-01 12:31:56 +00:00
Felix Marty	750ef7bc23	Merge branch 'ci_amd3' of github.com:huggingface/text-generation-inference into ci_amd3	2024-07-01 12:20:40 +00:00
Felix Marty	00cc73b7b7	fix post merge	2024-07-01 12:20:29 +00:00
icyboy™	9d0ca503a8	fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' (#2123 ) https://github.com/huggingface/text-generation-inference/issues/2122	2024-07-01 14:17:22 +02:00
fxmarty	59849777de	Merge branch 'main' into ci_amd3	2024-07-01 14:14:46 +02:00
Felix Marty	9fd395fae4	fix tests	2024-07-01 12:12:26 +00:00
Daniël de Kok	2ce8019480	Use GPTQ-Marlin for supported GPTQ configurations (#2111 ) GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So let's use it by default if the kernels are installed, the GPU supports it, and the kernels support the configuration. For models generated by `text-generation-server quantize`, use `sym=False`. This subcommand symmetric quantization since the beginning and incorrectly reporting the model to be symmetric will use GPTQ-Marlin (which does not support asymmetric quantization).	2024-07-01 12:59:12 +02:00
drbh	0d97a93c1e	feat: download lora adapter weights from launcher (#2140 )	2024-07-01 12:58:49 +02:00
drbh	25f57e2e98	fix: use weights from base_layer (#2141 )	2024-07-01 12:58:40 +02:00
Nicolas Patry	b4552f9de9	Fixing clippy. (#2149 )	2024-07-01 12:02:19 +02:00
Wang, Yi	6ea570ddfe	fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148 ) * fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices] Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 11:27:53 +02:00
Felix Marty	05d1011b4f	fix xpu build	2024-06-28 16:08:27 +00:00
Felix Marty	68583d3240	working memory leak fix in tunableop	2024-06-28 15:15:12 +00:00
Felix Marty	3d50ff71b7	bump torch to more recent version	2024-06-28 13:10:43 +00:00
Felix Marty	87db820627	fix rm	2024-06-28 09:49:20 +00:00
Nicolas Patry	fb98ab273f	Fixing the CI to also run in release when it's a tag ? (#2138 )	2024-06-28 09:31:09 +02:00
drbh	74b0231b19	fix: refactor post_processor logic and add test (#2137 ) * fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor	2024-06-27 23:16:19 +02:00
Felix Marty	eaa6890b3c	remove hidden	2024-06-27 15:24:14 +00:00

1 2 3 4 5 ...

887 Commits All Branches Search

887 Commits

All Branches