hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Morgan Funtowicz	45d5a6a8c5	feat(backend): add some initial decoding steps	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	098c66920d	feat(backend): tell cmake to build llama-common and link to it	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	0911076320	feat(backend): correctly load llama.cpp model from llama api and not gpt2	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	05ad684676	feat(llamacpp): enable cuda	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	fa89d1e613	misc(cmake): wut	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	e4432d36b1	misc(cmake): add parameter to build specific cuda arch	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	52d57dca79	feat(llamacpp): initial end2end build	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	7d1f8a2bd6	feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros	2024-11-14 08:42:01 +01:00
Morgan Funtowicz	aa1fcba59f	feat(llamacpp): initial commit # Conflicts: # Cargo.lock	2024-11-14 08:42:01 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
Wang, Yi	97f7a22f0b	add trust_remote_code in tokenizer to fix baichuan issue (#2725 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-07 14:43:38 +01:00
Wang, Yi	b1f9044d6c	fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717 ) fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ ipex kernel provide func like add_bias, so no need add it outside Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-04 16:07:51 +01:00
Daniël de Kok	5eedb2ec7a	nix: move to tgi-nix `main` (#2718 )	2024-11-04 15:40:13 +01:00
Nicolas Patry	9fde566602	Fixing linting on main. (#2719 )	2024-11-04 15:21:41 +01:00
Travis Addair	aadc9cb485	Fix prefix caching + speculative decoding (#2711 )	2024-11-04 15:08:43 +01:00
Nicolas Patry	a5593ba83e	Hotfixing auto length (warmup max_s was wrong). (#2716 )	2024-11-04 09:55:54 +01:00
drbh	08c4184eb2	fix: add chat_tokenize endpoint to api docs (#2710 )	2024-11-04 06:44:59 +01:00
drbh	6e3220529d	fix: create position ids for text only input (#2714 ) * fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone	2024-11-02 08:40:05 +08:00
drbh	01dacf8e8f	fix cuda graphs for qwen2-vl (#2708 ) * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-01 03:05:34 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Wang, Yi	46aeb0860d	add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702 ) add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-30 14:18:50 +01:00
Nicolas Patry	98330df65e	Monkey patching as a desperate measure. (#2704 ) * Monkey patching as a desperate measure. * New snapshot ?	2024-10-28 11:25:13 +01:00
Nicolas Patry	513d19b955	More timeout on docker start ? (#2701 ) * More timeout on docker start ? * Latest upgrade.	2024-10-28 08:57:22 +01:00
Nicolas Patry	3a9cdc3241	Fixing auto bloom test. (#2699 )	2024-10-28 06:14:11 +01:00
Nicolas Patry	78ce618c70	Update poetry lock. (#2698 )	2024-10-28 06:11:33 +01:00
Nicolas Patry	90b226db29	We can have a tokenizer anywhere. (#2527 ) * We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.	2024-10-28 05:00:24 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
Nicolas Patry	2e4f4ba1bb	Green main (#2697 )	2024-10-28 04:59:32 +01:00
Nicolas Patry	8a8794a672	Avoiding timeout for bloom tests. (#2693 ) * Avoiding timeout for bloom tests. * Skip the test let's see if it's always the first tests that fails. * Fail early. * Pulling ? * No early exit.	2024-10-26 05:35:28 +02:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00
OlivierDehaene	6f88bd9390	feat: add triton kernels to decrease latency of large batches (#2687 ) * feat: add triton kernels to decrease latency of large batches * cast to int32 * fix kernel * fix kernel * disable triton on rocm * fix speculation * add slots filtering kernel	2024-10-25 21:10:00 +00:00
Daniël de Kok	0f346a3296	Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688 ) * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots	2024-10-25 16:40:47 +02:00
Funtowicz Morgan	ba5fc7d922	Add support for stop words in TRTLLM (#2678 ) * feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json	2024-10-25 10:58:34 +02:00
Nicolas Patry	db68bd0524	Fixing mt0 test. (#2692 )	2024-10-25 09:46:39 +02:00
Nicolas Patry	cece8635f8	Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691 )	2024-10-25 09:17:57 +02:00
Funtowicz Morgan	43df056eee	[TENSORRT-LLM] - Implement new looper thread based backend (#2357 ) * (backend) use parking_lot crate for RwLock fairness # Conflicts: # backends/trtllm/src/backend.rs * (launcher) default new server::run parameters to false for now * (chore) fmt ... why? * (ffi) use const for GetSamplingConfig * (server) expose new SchedulingError * (trt) * (build) setup ccache if available * (ffi) add max_new_tokens parameters * (backend) cleanup a bit * (backend) expose PullNewTokens * (ffi) cleanup again * (ffi) add missing headers imports * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException> * (looper) new looper initial implementation * (ffi) remove narrowing type warning * (ffi) encode the provided user prompt within each request thread * (misc) change scope identifiers * (backend) implement the post_processor background thread * (misc) missing Result types for Rust * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step * (server) forward auth_token to server::run * (build) fetchcontent use archives instead of git * (ffi) fix usage of wrong vector constructor making a capacity fill call * (ffi) missing namespace for tle::Response * (ffi) do not use reference capture in lambda as we are not capturing anything * (backend) refactor & cleanup * (Dockerfile.trtllm) delete for now * (misc) simplify [make_]move_iterator by using c++20 type inference * (misc) no need to move for uint32_t items * (scheduler) rework submit/pull logic * (post) impl postprocessing * (misc) delete backend.rs * (misc) rerun-if-changed all the cmake modules * (misc) move to latest trtllm * (fix): HOPPER_SM_MAJOR is 9 not 8 * (misc: build for sm_{75,80,86,89,90} by default * (misc): build with trtllm 0.13.0 * (misc): increase verbosity of spdlog * (fix): do not recreate the stateful hashmap at every it * (misc): update dependency in trtllm dockerfile * (misc): update dependency in trtllm dockerfile * (misc): disable logging in release mode * (misc): improve trtllm download script robustness * (fix): ore fixes for Dockerfile * misc(cuda): require 12.6 * chore(cmake): use correct policy for download_timestamp * feat(looper): check engine and executorWorker paths exist before creating the backend * chore(cmake): download timestamp should be before URL * feat(looper): minor optimizations to avoid growing too much the containers * chore(trtllm): move dockerfile to right place * chore(trtllm): disable tokenizer parallelism by default * chore(trtllm): fmt * chore(trtllm): post-rebase commit * chore(trtllm): remove unused method * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime * misc(router): remove SchedulingError * feat(trtllm): do not tokenize twice * Revert "chore(trtllm): remove unused method" This reverts commit `31747163` * chore(rebase): fix invalid references * chore(router): add python dependency * Lint. * Fix bad rebase --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-10-25 07:17:14 +02:00
Nicolas Patry	ed87b464b4	Fixing "deadlock" when python prompts for trust_remote_code by always (#2664 ) specifiying a value.	2024-10-25 06:39:21 +02:00
Daniël de Kok	eab07f746c	Add support for FP8 KV cache scales (#2628 ) * Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer	2024-10-24 16:36:18 +02:00
Daniël de Kok	14a0df3a38	Fix Phi 3.5 MoE tests (#2684 ) PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix this.	2024-10-24 15:21:50 +02:00
Daniël de Kok	1b914f37e7	flashinfer: reminder to remove contiguous call in the future (#2685 )	2024-10-24 14:59:56 +02:00
OlivierDehaene	41c2623735	feat: allow any supported payload on /invocations (#2683 ) * feat: allow any supported payload on /invocations * update openAPI * update doc	2024-10-23 11:26:01 +00:00
OlivierDehaene	27ff1871b5	hotfix: fix flashllama	2024-10-23 13:22:31 +02:00
OlivierDehaene	03c9388bf7	feat: natively support Granite models (#2682 ) * feat: natively support Granite models * Update doc	2024-10-23 10:04:05 +00:00
Daniël de Kok	f58eb70ebf	Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632 )	2024-10-23 11:07:31 +02:00
Daniël de Kok	9c9ef37c56	Add `impureWithCuda` dev shell (#2677 ) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN	2024-10-22 11:02:55 +02:00
Wang, Yi	058d3061f7	break when there's nothing to read (#2582 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-21 15:22:48 +02:00
Daniël de Kok	7f54b7336a	Test Marlin MoE with `desc_act=true` (#2622 ) Update the Mixtral GPTQ test to use a model with `desc_act=true` and `group_size!=-1` to ensure that we are checking activation sorting/non-full K (with tensor parallelism). The `desc_act=false` case is already checked by the Mixtral AWQ test.	2024-10-21 12:50:35 +02:00
Daniël de Kok	5e0fb46821	Make handling of FP8 scales more consisent (#2666 ) Change `fp8_quantize` so that we can pass around reciprocals everywhere, so scales are always passed around in the checkpoint format. I also noticed that we ignore any input scales that we might have when fbgemm is available. Skip this path if we already have a scale.	2024-10-19 09:05:01 +02:00
Nicolas Patry	153ff3740b	CI job. Gpt awq 4 (#2665 ) * add gptq and awq int4 support in intel platform Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-18 17:55:53 +02:00
Daniël de Kok	8ec57558cd	Break cycle between the attention implementations and KV cache (#2627 )	2024-10-17 14:54:22 +02:00

1 2 3 4 5 ...

1129 Commits All Branches Search

1129 Commits

All Branches