hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
David Holtz	0fd2ab3e89	fix: remove unused deps and imports	2024-11-18 21:48:09 +00:00
David Holtz	e428a14d19	fix: add protobuf update and mp4parse dep	2024-11-18 21:22:19 +00:00
David Holtz	a9c2d28a3a	feat: support video input chunks and enable qwen2 vl to process video	2024-11-18 21:16:21 +00:00
Miquel Farre	6b4697e9d1	fix	2024-11-18 13:03:09 -05:00
Miquel Farre	cee1dea803	refactoring	2024-11-18 13:01:50 -05:00
Miquel Farre	f7cf45dfde	fix	2024-11-18 13:01:50 -05:00
Miquel Farre	bd04258e2c	downloading videos	2024-11-18 13:01:50 -05:00
Miquel Farre	f9ee2500cf	fix	2024-11-18 13:01:50 -05:00
Miquel Farre	b4e096c080	connecting video to qwen2	2024-11-18 13:01:50 -05:00
Miquel Farre	da644c21e5	adopting video url	2024-11-18 13:01:50 -05:00
Miquel Farre	fc5b0ac1fd	router changes	2024-11-18 13:01:50 -05:00
Miquel Farre	de6c68443e	WIP video support	2024-11-18 13:01:50 -05:00
drbh	38cff84a3e	feat: support flash attention 2 in qwen2 vl vision blocks (#2721 ) * feat: support flash attention 2 in qwen2 vl vision blocks * fix: calc max_seqlen once and small refactors	2024-11-18 12:46:40 -05:00
Daniël de Kok	3c9df21ff8	Add support for compressed-tensors w8a8 int checkpoints (#2745 ) * Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: \| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|---------------\|------:\|----------------\|-----:\|-----------------------\|---\|-----:\|---\|------\| \|gsm8k_cot_llama\| 3\|flexible-extract\| 8\|exact_match \|↑ \|0.8431\|± \|0.0100\| \| \| \|strict-match \| 8\|exact_match \|↑ \|0.8393\|± \|0.0101\| \|ifeval \| 4\|none \| 0\|inst_level_loose_acc \|↑ \|0.8597\|± \| N/A\| \| \| \|none \| 0\|inst_level_strict_acc \|↑ \|0.8201\|± \| N/A\| \| \| \|none \| 0\|prompt_level_loose_acc \|↑ \|0.7967\|± \|0.0173\| \| \| \|none \| 0\|prompt_level_strict_acc\|↑ \|0.7468\|± \|0.0187\| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by: drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-11-18 17:20:31 +01:00
Wang, Yi	a5ecd6e586	add ipex moe implementation to support Mixtral and PhiMoe (#2707 ) * add ipex moe implementation to support Mixtral and PhiMoe Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update to ipex xpu 2.5 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * torch has xpu support in 2.5 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix oneapi basekit version Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Daniël de Kok <me@github.danieldk.eu>	2024-11-18 17:16:55 +01:00
drbh	fea62e928f	fix: improve find_segments via numpy diff (#2686 )	2024-11-18 09:51:06 -05:00
Daniël de Kok	52e48739a5	Remove vLLM dependency for CUDA (#2751 ) * Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning	2024-11-17 17:34:50 +01:00
drbh	6489f85269	feat: return streaming errors as an event formatted for openai's client (#2668 ) * feat: return streaming errors as an event formatted for openai's client * fix: propagate completions error events to stream * fix: improve stream api error format and add status code * fix: improve streamin error to include error_type * Revert "fix: improve streamin error to include error_type" This reverts commit `2b1a360b15`. * Reworked the implementation. * Revert "Reworked the implementation." This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5. * Small lifting. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-11-15 14:49:19 +01:00
Nicolas Patry	34a3bdedc3	Upgrading our deps. (#2750 ) * Upgrading our deps. * fixup. * Fixup.	2024-11-15 14:03:27 +01:00
Alex Weston	4580ced091	Upgrade outlines to 0.1.1 (#2742 ) * Upgrade outlines to 0.1.1 * Update for new API * Check if allowed tokens is None --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-11-15 13:22:52 +01:00
jito	003eaec0fb	fix response type of document for Text Generation Inference (#2743 ) Signed-off-by: jitokim <pigberger70@gmail.com>	2024-11-15 13:21:50 +01:00
Billel Mokeddem	4f4857a4ac	Fix: Change embeddings to embedding (#2738 ) fix: change embeddings to embedding Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>	2024-11-15 13:16:15 +01:00
Billel Mokeddem	f9ee46f740	Fix: Change model_type from ssm to mamba (#2740 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>	2024-11-15 13:15:36 +01:00
Daniël de Kok	8442f1ac85	benchmark: fix prefill throughput (#2741 )	2024-11-15 13:14:55 +01:00
Daniël de Kok	ca4f46ddfc	nix: update nixpkgs (#2746 ) Updates from Triton 2.1.0 to 3.1.0 (among other things).	2024-11-14 18:48:20 +01:00
Daniël de Kok	a785000842	Add initial support for compressed-tensors checkpoints (#2732 ) compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.	2024-11-10 13:54:07 +01:00
Wang, Yi	97f7a22f0b	add trust_remote_code in tokenizer to fix baichuan issue (#2725 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-07 14:43:38 +01:00
Wang, Yi	b1f9044d6c	fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717 ) fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ ipex kernel provide func like add_bias, so no need add it outside Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-04 16:07:51 +01:00
Daniël de Kok	5eedb2ec7a	nix: move to tgi-nix `main` (#2718 )	2024-11-04 15:40:13 +01:00
Nicolas Patry	9fde566602	Fixing linting on main. (#2719 )	2024-11-04 15:21:41 +01:00
Travis Addair	aadc9cb485	Fix prefix caching + speculative decoding (#2711 )	2024-11-04 15:08:43 +01:00
Nicolas Patry	a5593ba83e	Hotfixing auto length (warmup max_s was wrong). (#2716 )	2024-11-04 09:55:54 +01:00
drbh	08c4184eb2	fix: add chat_tokenize endpoint to api docs (#2710 )	2024-11-04 06:44:59 +01:00
drbh	6e3220529d	fix: create position ids for text only input (#2714 ) * fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone	2024-11-02 08:40:05 +08:00
drbh	01dacf8e8f	fix cuda graphs for qwen2-vl (#2708 ) * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-11-01 03:05:34 +01:00
drbh	befd9f6735	Support qwen2 vl (#2689 ) * feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2	2024-10-30 12:40:51 -04:00
Wang, Yi	46aeb0860d	add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702 ) add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'" Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-10-30 14:18:50 +01:00
Nicolas Patry	98330df65e	Monkey patching as a desperate measure. (#2704 ) * Monkey patching as a desperate measure. * New snapshot ?	2024-10-28 11:25:13 +01:00
Nicolas Patry	513d19b955	More timeout on docker start ? (#2701 ) * More timeout on docker start ? * Latest upgrade.	2024-10-28 08:57:22 +01:00
Nicolas Patry	3a9cdc3241	Fixing auto bloom test. (#2699 )	2024-10-28 06:14:11 +01:00
Nicolas Patry	78ce618c70	Update poetry lock. (#2698 )	2024-10-28 06:11:33 +01:00
Nicolas Patry	90b226db29	We can have a tokenizer anywhere. (#2527 ) * We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.	2024-10-28 05:00:24 +01:00
Nicolas Patry	0c9b6cdd76	Choosing input/total tokens automatically based on available VRAM? (#2673 ) * Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).	2024-10-28 04:59:49 +01:00
Nicolas Patry	2e4f4ba1bb	Green main (#2697 )	2024-10-28 04:59:32 +01:00
Nicolas Patry	8a8794a672	Avoiding timeout for bloom tests. (#2693 ) * Avoiding timeout for bloom tests. * Skip the test let's see if it's always the first tests that fails. * Fail early. * Pulling ? * No early exit.	2024-10-26 05:35:28 +02:00
OlivierDehaene	a6b02da971	chore: prepare 2.4.0 release (#2695 )	2024-10-25 21:10:49 +00:00
OlivierDehaene	6f88bd9390	feat: add triton kernels to decrease latency of large batches (#2687 ) * feat: add triton kernels to decrease latency of large batches * cast to int32 * fix kernel * fix kernel * disable triton on rocm * fix speculation * add slots filtering kernel	2024-10-25 21:10:00 +00:00
Daniël de Kok	0f346a3296	Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688 ) * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots	2024-10-25 16:40:47 +02:00
Funtowicz Morgan	ba5fc7d922	Add support for stop words in TRTLLM (#2678 ) * feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json	2024-10-25 10:58:34 +02:00
Nicolas Patry	db68bd0524	Fixing mt0 test. (#2692 )	2024-10-25 09:46:39 +02:00

1 2 3 4 5 ...

1145 Commits All Branches Search

1145 Commits

All Branches