hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Morgan Funtowicz	31d9254776	feat(backend): remove static from inner_fw visitor as it leads to invalid memory locations	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7b0a56f40f	feat(backend): fix memory leaking on llama_sampler when the decode ends	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	86a2ae6ba2	chore: unsued variables	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	2cdfed94d9	feat(backend): correctly link to shared fmt and spdlog instead of static	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	bd8f0f15e1	feat(backend): fix invalid reference to ctx instead of context in release build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	3e82f14f57	feat(backend): somewhat generates the final infer response	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	b50dcddbb8	feat(backend): avoid dropping the boxed stream at the end of the callback	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	612f2f939f	feat(backend): bind incoming request to the server	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d4aee42fd8	feat(backend): add logit parameter in the callback fn	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f39edc72ff	feat(backend): add mapping for ignore_eos_token stopping criteria	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	3af2c6837c	misc(offline): match rework	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d52b4c4978	feat(backend): full rework of the backend internal to safer c++	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	6a5f6b0755	misc(offline): update offline tester	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	b98c635781	feat(backend): entirely rewrite backend	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	611590440d	misc(offline): expose more parameters for generate	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	dbc5b7a0f7	misc(offline): link correctly	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0c1dd0ed2b	feat(llamacpp): wip explosion	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	a316c53255	feat(llamacpp): expose number of threads for the backend when constructing the model	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	179309b364	misc(build): refactor build type detection in cmake	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f0859c247f	misc(build): handle different lib destination folder lib/lib64	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4d803c94e	feat(backend): build and link through build.rs	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	355d8a55b4	feat(backend): wip Rust binding	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	f9c248657d	chore(backend): minor formatting	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	37faeb34b2	feat(backend): expose frequency and repetition penalties	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	d4b5be10f9	feat(backend): minor refactor	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	92bb113653	feat(backend): use llama_token as TokenId type	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	45d5a6a8c5	feat(backend): add some initial decoding steps	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	098c66920d	feat(backend): tell cmake to build llama-common and link to it	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0911076320	feat(backend): correctly load llama.cpp model from llama api and not gpt2	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	05ad684676	feat(llamacpp): enable cuda	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	fa89d1e613	misc(cmake): wut	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4432d36b1	misc(cmake): add parameter to build specific cuda arch	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	52d57dca79	feat(llamacpp): initial end2end build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7d1f8a2bd6	feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	aa1fcba59f	feat(llamacpp): initial commit # Conflicts: # Cargo.lock	2024-11-14 08:42:01 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
Wang, Yi	97f7a22f0b	add trust_remote_code in tokenizer to fix baichuan issue (#2725 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-07 14:43:38 +01:00
Wang, Yi	b1f9044d6c	fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717 ) fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ ipex kernel provide func like add_bias, so no need add it outside Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-04 16:07:51 +01:00
Daniël de Kok	5eedb2ec7a	nix: move to tgi-nix `main` (#2718 )	2024-11-04 15:40:13 +01:00
Nicolas Patry	9fde566602	Fixing linting on main. (#2719 )	2024-11-04 15:21:41 +01:00
Travis Addair	aadc9cb485	Fix prefix caching + speculative decoding (#2711 )	2024-11-04 15:08:43 +01:00
Nicolas Patry	a5593ba83e	Hotfixing auto length (warmup max_s was wrong). (#2716 )	2024-11-04 09:55:54 +01:00
drbh	08c4184eb2	fix: add chat_tokenize endpoint to api docs (#2710 )	2024-11-04 06:44:59 +01:00
drbh	6e3220529d	fix: create position ids for text only input (#2714 ) * fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone	2024-11-02 08:40:05 +08:00
drbh	01dacf8e8f	fix cuda graphs for qwen2-vl (#2708 ) * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-01 03:05:34 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Wang, Yi	46aeb0860d	add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702 ) add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-30 14:18:50 +01:00
Nicolas Patry	98330df65e	Monkey patching as a desperate measure. (#2704 ) * Monkey patching as a desperate measure. * New snapshot ?	2024-10-28 11:25:13 +01:00
Nicolas Patry	513d19b955	More timeout on docker start ? (#2701 ) * More timeout on docker start ? * Latest upgrade.	2024-10-28 08:57:22 +01:00
Nicolas Patry	3a9cdc3241	Fixing auto bloom test. (#2699 )	2024-10-28 06:14:11 +01:00

1 2 3 4 5 ...

1205 Commits All Branches Search

1205 Commits

All Branches