hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicolas Patry	b4654a36dc	Fixing up the tests ?	2024-09-16 17:01:51 +02:00
Nicolas Patry	5adece6313	This doesn't seem needed.	2024-09-16 17:01:51 +02:00
Nicolas Patry	dd4b774e0d	New cargo lock	2024-09-16 17:01:44 +02:00
Nicolas Patry	b7cb8d5145	Let's figure out the issue...	2024-09-16 17:01:30 +02:00
Nicolas Patry	3d7b81535a	Only link cuda driver librairies.	2024-09-16 17:01:30 +02:00
Nicolas Patry	e898483db6	Updating outlines to 0.0.46	2024-09-16 17:01:30 +02:00
Nicolas Patry	ce3efc83ed	Remove tmate.	2024-09-16 17:01:30 +02:00
Nicolas Patry	7f58f7dc61	Symlink all the things.	2024-09-16 17:01:29 +02:00
Nicolas Patry	42107de71f	Let's try to find libnvidia-ml	2024-09-16 17:01:29 +02:00
Nicolas Patry	edaa7f847d	Does this work ?	2024-09-16 17:01:29 +02:00
Nicolas Patry	d1e79ddae0	Fix override.	2024-09-16 17:01:29 +02:00
Nicolas Patry	db054b95df	Check the paths.	2024-09-16 17:01:29 +02:00
Nicolas Patry	afcd047a58	Yaml yaml.	2024-09-16 17:01:29 +02:00
Nicolas Patry	60db294f9a	Link cuda to nix ?	2024-09-16 17:01:28 +02:00
Nicolas Patry	8e7c7c61f1	Let's see what the issue is ?	2024-09-16 17:01:28 +02:00
Nicolas Patry	815449da74	Removing unused code.	2024-09-16 17:01:28 +02:00
Nicolas Patry	c227345878	Run on actual GPUs.	2024-09-16 17:01:28 +02:00
Nicolas Patry	3d73c99ebe	Attempt at integration tests.	2024-09-16 17:01:28 +02:00
Nicolas Patry	f47cdc1fe1	Attempting rapidly the integration tests.	2024-09-16 17:01:26 +02:00
Nicolas Patry	38fcafcf96	Adding a test for FD. (#2516 ) * Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.	2024-09-16 17:00:54 +02:00
Daniël de Kok	7774655297	Add tests for Mixtral (#2520 ) Disable by default because CI runners do not have enough GPUs.	2024-09-16 12:39:18 +02:00
Alex Strick van Linschoten	9cca3e0b03	Use `ratatui` not (deprecated) `tui` (#2521 ) * use ratatui not archived tui * bump ratatui all the way with options	2024-09-13 18:45:28 +02:00
Wang, Yi	3ac7df2b6d	hotfix : enable intel ipex cpu and xpu in python3.11 (#2517 ) enable intel ipex cpu and xpu in python3.11 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-12 17:23:49 +02:00
drbh	628334d336	fix: pass missing revision arg for lora adapter when loading multiple… (#2510 ) fix: pass missing revision arg for lora adapter when loading multiple adapters	2024-09-12 17:04:52 +02:00
Nicolas Patry	d95c670ada	Add nix test. (#2513 ) * Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.	2024-09-12 14:54:56 +02:00
Daniël de Kok	94304649f1	nix: support Python tokenizer conversion in the router (#2515 ) Ideally we wouldn't have the router wrapper that this change adds, but when I give PyO3 a Python interpreter with packages, it ends up linking libpython from the Python interpreter rather than the constructed environment and cannot pick up the Python modules as a result.	2024-09-12 10:44:01 +02:00
Nicolas Patry	69e3be20fb	Fix truffle (#2514 ) * Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.	2024-09-11 22:45:19 +02:00
Nicolas Patry	dae3bf1d87	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-11 22:41:56 +02:00
Nicolas Patry	a4e3e8c608	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-11 18:10:40 +02:00
Vallepu Vamsi Krishna	eabbbbda23	Add Directory Check to Prevent Redundant Cloning in Build Process (#2486 ) Update Makefile-fbgemm Added Directory check for FBGEMM repository cloning.	2024-09-07 13:19:43 +02:00
Nicolas Patry	c1fe28d694	Fixing more correctly the invalid drop of the batch. (#2498 )	2024-09-06 17:35:49 +02:00
Martin Iglesias Goyanes	aaea212d0f	Add links to Adyen blogpost (#2500 ) * Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-06 17:00:54 +02:00
Daniël de Kok	a3c9c62dc0	hotfix: add syrupy to the right subproject (#2499 )	2024-09-06 12:47:06 +02:00
Daniël de Kok	379472c4c2	radix trie: add assertions (#2491 ) These should all be cheap assertions. Also: * Fixup some comments. * Delete a `remove` that was done unnecessarily twice.	2024-09-06 11:55:23 +02:00
Daniël de Kok	2eb57a15ec	Fix incompatibility with latest `syrupy` and update in Poetry (#2497 )	2024-09-06 11:00:52 +02:00
Daniël de Kok	0424e27f65	nix: add pyright/ruff for proper LSP in the impure devshell (#2496 ) We need this to ensure that pyright/ruff are part of the same interpreter/venv.	2024-09-06 10:19:04 +02:00
Wang, Yi	5cd8025f18	hotfix: fix regression of attention api change in intel platform (#2439 ) fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache format kv input now. Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-09-05 17:41:39 +02:00
Daniël de Kok	e279b38aca	Add two handy gitignores for Nix environments (#2484 )	2024-09-05 17:06:54 +02:00
Nicolas Patry	8b96a18265	Adding links to Adyen blogpost. (#2492 )	2024-09-05 16:11:52 +02:00
Daniël de Kok	deec30f893	hotfix: avoid non-prefilled block use when using prefix caching (#2489 ) The minimum batch size logic could cause prefix blocks to be deallocated without prefill. The next allocation of the same prefix would then use garbage blocks.	2024-09-05 15:09:29 +02:00
drbh	6cb42f49ae	feat: support lora revisions and qkv_proj weights (#2482 ) * feat: support lora revisions and qkv_proj weights * fix: add qkv_proj weights to weight test	2024-09-02 13:09:06 -04:00
drbh	47d7e34458	fix: enable chat requests in vertex endpoint (#2481 ) * fix: enable chat requests in vertex endpoint * feat: avoid unwrap and pre allocate future vec	2024-09-02 10:00:52 -04:00
Daniël de Kok	de2cdeca53	nix: add punica-kernels (#2477 ) Enables LoRA support.	2024-09-02 11:31:36 +02:00
Daniël de Kok	e4ab855480	nix: improve impure devshell (#2478 ) - Add some test dependencies. - Install server in venv. - Install Python client in venv.	2024-09-02 09:27:10 +02:00
Nicolas Patry	d9fbbaafb0	Tied embeddings in MLP speculator. (#2473 ) * Tied embeddings in MLP speculator. * Fixing the scale_weight when users decide to not use the speculation as much as defined in the config. * Adding scaling support + optimize some ops.	2024-08-29 17:44:54 +02:00
Wang, Yi	9883f3b40e	update doc with intel cpu part (#2420 ) * update doc with intel cpu part Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release. --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-29 17:42:02 +02:00
drbh	d5202c46f7	feat: add /v1/models endpoint (#2433 ) * feat: add /v1/models endpoint * feat: add /v1/models endpoint * fix: remove unused type import * fix: revert route typo * fix: update docs with new endpoint * fix: add to redocly ignore and lint	2024-08-29 16:32:38 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
Daniël de Kok	4e821c003a	nix: build Torch against MKL and various other improvements (#2469 ) Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.	2024-08-29 16:25:25 +02:00
drbh	8f99f165ce	fix: improve regex expression (#2468 )	2024-08-28 13:44:44 -04:00

1 2 3 4 5 ...

1079 Commits All Branches Search

1079 Commits

All Branches