hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
drbh	93a7042d7e	feat: support phi3.5 moe (#2479 ) * feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-30 11:15:09 +02:00
Daniël de Kok	5b6b74e21d	Improve support for GPUs with capability < 8 (#2575 ) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s	2024-09-27 16:19:42 +02:00
Alvaro Bartolome	0aa66d693a	Fix build with `--features google` (#2566 ) * Fix `cargo build --features google` * Add `cargo test --features google`	2024-09-26 11:41:38 +02:00
Nicolas Patry	c032280b17	Cleanup Vertex + Chat (#2553 ) * Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.	2024-09-24 23:37:17 +02:00
OlivierDehaene	10e6f29295	chore: Add old V2 backend (#2551 ) * wip * added v2	2024-09-24 08:38:17 +02:00
Nicolas Patry	f512021e77	Stream options. (#2533 ) * Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow	2024-09-19 20:50:37 +02:00
OlivierDehaene	86984e3236	fix: metrics unbounded memory (#2528 )	2024-09-17 16:01:28 +00:00
Nicolas Patry	dae3bf1d87	Fix tokenization yi (#2507 ) * Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?	2024-09-11 22:41:56 +02:00
Nicolas Patry	a4e3e8c608	Prefix test - Different kind of load test to trigger prefix test bugs. (#2490 ) * Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by: Guillaume LEGENDRE <glegendre01@gmail.com>	2024-09-11 18:10:40 +02:00
drbh	47d7e34458	fix: enable chat requests in vertex endpoint (#2481 ) * fix: enable chat requests in vertex endpoint * feat: avoid unwrap and pre allocate future vec	2024-09-02 10:00:52 -04:00
drbh	d5202c46f7	feat: add /v1/models endpoint (#2433 ) * feat: add /v1/models endpoint * feat: add /v1/models endpoint * fix: remove unused type import * fix: revert route typo * fix: update docs with new endpoint * fix: add to redocly ignore and lint	2024-08-29 16:32:38 +02:00
Nicolas Patry	e415b690a6	Lots of improvements (Still 2 allocators) (#2449 ) * Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by: drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by: OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: OlivierDehaene <olivier@huggingface.co>	2024-08-29 16:29:01 +02:00
drbh	21187c27c9	fix: bump minijinja version and add test for llama 3.1 tools (#2463 ) * fix: support tojson and avoid message indexing issue in template * fix: prefer minijinja native methods and prefer workspace level dependency * fix: adjust comment typo	2024-08-27 13:31:08 -04:00
drbh	cfa73b5c99	Pr 2451 ci branch (#2454 ) * fix[router]: Fix tools not passed in chat template Signed-off-by: GitHub <noreply@github.com> * feat: improve default tool serialization and lints * feat: refactor tool logic to include notify_error in prompt and adjust typing * fix: adjust non tool template apply * fix: simplify tool grammar logic and improve schema * feat: avoid skip tool test and avoid empty tool prompts * fix: increase test client timeout for grammar compilation tests --------- Signed-off-by: GitHub <noreply@github.com> Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-08-26 20:19:38 -04:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00
drbh	9a7830bd28	Pr 2395 ci run (#2406 ) * fix(router): Fix appending to message content * feat: add message and chat template test --------- Co-authored-by: Simone Rossi <simone.rossi.93@gmail.com>	2024-08-12 14:38:59 -04:00
drbh	30395b09f4	fix: improve completions to send a final chunk with usage details (#2336 ) * fix: improve completions to send a final chunk with usage details * fix: include finish reason string * fix: remove dev debug trait and unneeded mut * fix: update openapi schema	2024-08-12 17:26:11 +02:00
drbh	155f9c98e2	feat: validate template variables before apply and improve sliding wi… (#2403 ) * feat: validate template variables before apply and improve sliding window check * fix: improve missing template var test	2024-08-12 10:58:40 -04:00
Nicolas Patry	136bcc8128	Keeping the benchmark somewhere (#2401 ) Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-08-12 15:22:02 +02:00
Daniël de Kok	8deeaca4ff	Add support for prefix caching to the v3 router (#2392 ) This change adds support for prefix caching to the v3 router. This is broken up from the backend support to ease reviewing. For now prefix caching is only enabled with `USE_PREFIX_CACHING=1` in this case, the router will switch to `RadixAllocator`. This allocator uses a radix trie to keep track of prefills that were seen prior. If a new prefill is a prefix of a previously-seen prefil, the router will send a request with `prefix_len>0`, which can be used by the backend to decide to reuse KV blocks from the cache, rather than recomputing them. Even though backend support is not added in this PR, the backend will still work with prefix caching enabled. The prefix lengths are just ignored and not used.	2024-08-12 14:59:17 +02:00
drbh	0d06aed02d	feat: add guideline to chat request and template (#2391 ) * feat: add guideline to chat request and template * fix: add template test and update docs	2024-08-09 10:56:45 -04:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
drbh	6d06473cf4	Pr 2352 ci branch (#2382 ) * Fix unsigned integer underflow Passing --max-batch-size to the launcher actually had no effect because after a few requests the max_size passed to State::next_batch would underflow becoming a largo positive number. In the scheduler, as soon as the cached batch size reached the max_batch_size the max_size passed to next_batch becomes 0. Since the only check in that funcion is ``` if Some(batch_requests.len()) == max_size { break; } ``` and it's called after the `batch_requests.len()` has become 1, it doesn't do anything to prevent more than 0 requests from being batched. Now we have cached batch in the server that is large than max_batch_size and `max_size - batch_size as usize` underflows. Signed-off-by: Max de Bayser <mbayser@br.ibm.com> * fix: update v3 scheduler and ensure max_batch_size > 0 --------- Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>	2024-08-09 10:54:32 +02:00
drbh	21267f3ca3	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-07 21:32:37 -04:00
drbh	1768c00b9f	feat: return the generated text when parsing fails (#2353 )	2024-08-06 13:10:19 -04:00
drbh	f8a5b381fe	feat: prefer stop over eos_token to align with openai finish_reason (#2344 )	2024-08-06 13:09:50 -04:00
drbh	e11f5f1c38	feat: implement a templated endpoint for visibility into chat requests (#2333 ) * feat: implement a templated endpoint for visibility into chat requests * feat: improve to tokenize too * fix: adjust return type * feat: simplify prepare_chat_input logic and adjust start stop chars	2024-08-06 13:51:32 +02:00
Erik Kaunismäki	7451041ecd	refactor usage stats (#2339 ) * refactor usage stats * Update docs/source/usage_statistics.md Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * changes based on feedback * run python3 udpate_doc.py * fix pre-commit * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * delete option around usage stats arg --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-31 16:29:07 +02:00
Nicolas Patry	2b19d671b4	Rebase TRT-llm (#2331 ) * wip wip refacto refacto Initial setup for CXX binding to TRTLLM Working FFI call for TGI and TRTLLM backend Remove unused parameters annd force tokenizer name to be set Overall build TRTLLM and deps through CMake build system Enable end to end CMake build First version loading engines and making it ready for inference Remembering to check how we can detect support for chunked context Move to latest TensorRT-LLM version Specify which default log level to use depending on CMake build type make leader executor mode working unconditionally call InitializeBackend on the FFI layer bind to CUDA::nvml to retrieve compute capabilities at runtime updated logic and comment to detect cuda compute capabilities implement the Stream method to send new tokens through a callback use spdlog release 1.14.1 moving forward update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c correctly tell cmake to build dependent tensorrt-llm required libraries create cmake install target to put everything relevant in installation folder add auth_token CLI argument to provide hf hub authentification token allow converting huggingface::tokenizers error to TensorRtLlmBackendError use correct include for spdlog include guard to build example in cmakelists working setup of the ffi layer remove fmt import use external fmt lib end to end ffi flow working make sure to track include/ffi.h to trigger rebuild from cargo impl the rust backend which currently cannot move the actual computation in background thread expose shutdown function at ffi layer impl RwLock scenario for TensorRtLllmBackend oops missing c++ backend definitions compute the number of maximum new tokens for each request independently make sure the context is not dropped in the middle of the async decoding. remove unnecessary log add all the necessary plumbery to return the generated content update invalid doc in cpp file correctly forward back the log probabilities remove unneeded scope variable for now refactor Stream impl for Generation to factorise code expose the internal missing start/queue timestamp forward tgi parameters rep/freq penalty add some more validation about grammar not supported define a shared struct to hold the result of a decoding step expose information about potential error happening while decoding remove logging add logging in case of decoding error make sure executor_worker is provided add initial Dockerfile for TRTLLM backend add some more information in CMakeLists.txt to correctly install executorWorker add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper simplify prebuilt trtllm libraries name definition do the same name definition stuff for tensorrt_llm_executor_static leverage pkg-config to probe libraries paths and reuse new install structure from cmake fix bad copy/past missing nvinfer linkage direction align all the linker search dependency add missing pkgconfig folder for MPI in Dockerfile correctly setup linking search path for runtime layer fix missing / before tgi lib path adding missing ld_library_path for cuda stubs in Dockerfile update tgi entrypoint commenting out Python part for TensorRT installation refactored docker image move to TensorRT-LLM v0.11.0 make docker linter happy with same capitalization rule fix typo refactor the compute capabilities detection along with num gpus update TensorRT-LLM to latest version update TensorRT install script to latest update build.rs to link to cuda 12.5 add missing dependant libraries for linking clean up a bit install to decoder_attention target add some custom stuff for nccl linkage fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time use std::env::const::ARCH make sure variable live long enough... look for cuda 12.5 add some more basic info in README.md * Rebase. * Fix autodocs. * Let's try to enable trtllm backend. * Ignore backends/v3 by default. * Fixing client. * Fix makefile + autodocs. * Updating the schema thing + redocly. * Fix trtllm lint. * Adding pb files ? * Remove cargo fmt temporarily. * ? * Tmp. * Remove both check + clippy ? * Backporting telemetry. * Backporting `457fb0a1` * Remove PB from git. * Fixing PB with default member backends/client * update TensorRT-LLM to latest version * provided None for api_key * link against libtensorrt_llm and not libtensorrt-llm --------- Co-authored-by: OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by: Morgan Funtowicz <morgan@huggingface.co>	2024-07-31 10:33:10 +02:00
drbh	f15e808d4c	fix: reject grammars without properties (#2309 )	2024-07-29 10:07:25 -04:00
Erik Kaunismäki	583d37a2f8	Run ci api key (#2315 ) * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>	2024-07-29 11:14:17 +02:00
drbh	68a9685f1b	fix: adjust default tool choice (#2244 ) * fix: adjust default tool choice * feat: improve tool choice syntax and response parsing/errors * fix: remove dev tests * feat: add ToolChoice to docs	2024-07-19 11:12:02 -04:00
Erik Kaunismäki	4c19593a90	usage stats and crash reports (#2220 ) * draft of usage stats * fix wrong link * launcher doesn't need sysinfo dep * only tokenizer class instead of hole struct * unused import * fix clippy errors * update openAPI doc * cargo fmt * fix error in passing flags to router * try again to update docs * run pre-commit locally * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * on crash use anonymous error event * delete json_output and ngrok * more robust way of checking if is in container * more robust nvidia smi * parse xpu more robustly * fix errors * add nvidia-smi details in docs * cargo fmt * fix clippy * should make docs check pass * Update router/src/usage_stats.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * error reason can't be in nested json * cargo fmt --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>	2024-07-19 16:17:56 +02:00
Erik Kaunismäki	457fb0a188	fix custom cache dir (#2226 ) * fix to not ignore HUGGINGFACE_HUB_CACHE in cache * delete printlns * delete newlines * maybe fix trailing whitespace	2024-07-15 15:17:13 +02:00
drbh	d789de329a	fix: append DONE message to chat stream (#2221 ) * fix: append DONE message to chat stream * fix: update completions endpoint	2024-07-11 10:42:58 -04:00
Nicolas Patry	4c976fb406	Updating the self check (#2209 ) * Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.	2024-07-09 17:23:48 +02:00
Nicolas Patry	fe710af25f	Adding sanity check to openapi docs.	2024-07-09 11:13:48 +02:00
drbh	87ebb6477b	feat: use model name as adapter id in chat endpoints (#2128 )	2024-07-08 16:06:49 +02:00
Wang, Yi	58effe78b5	update to metrics 0.23.0 or could work with metrics-exporter-promethe… (#2190 ) update to metrics 0.23.0 or could work with metrics-exporter-prometheus 0.15.1 Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-08 16:03:59 +02:00
Nicolas Patry	5ad41aa2a6	Fixing missing `object` field for regular completions. (#2175 ) * Fixing missing `object` field for regular completions. * Fixing docs by re-adding missing `Prompt`.	2024-07-03 12:56:27 +02:00
Nicolas Patry	be4a4c47f9	Revert "Fixing missing `object` field for regular completions." This reverts commit `2bbb7fa4b2`.	2024-07-03 10:41:39 +00:00
Nicolas Patry	2bbb7fa4b2	Fixing missing `object` field for regular completions.	2024-07-03 10:40:22 +00:00
drbh	571530dd9a	feat: improve update_docs for openapi schema (#2169 ) * feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff	2024-07-03 09:53:35 +02:00
Nicolas Patry	4327210e6b	[Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940 ) * Using flash decoding Conditional flashdecoding. Fix max_q. Working kvcache Working version with flash decoding. Make it work for mistral. Fix after rebase.. Less intrusive. REvert changes in modeling. Speedup flashdecoding. HHachweew Hack to make other models work. Fixing non flash decoding llama path. Router logic knows about page size. Missing 2 models. Missing cohere. Fixing cohere flash decoding. Revamped all this architecture. Fix cohere. Fixing falcon. Enabling custom block size schedule. Update router/src/infer.rs Not sending preallocated output. * Making it work on non flash decoding. * Fix Cohere. * Fix non decoding paths. * Rebased. * No need for cache_manager anymore. * Update? * "ipex" -> "cpu" * These do not belong. * Factoring cu_seqlen_qk for better abstracting over every model. * Fixing non flash tests/imports. * Changing return everywhere. * Update mistral past. * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though). * Fixup mistral clamping (had issues with cuda graphs). * No need to recreate anything actually.	2024-07-01 23:28:00 +02:00
drbh	9eefb2f672	fix: prefer serde structs over custom functions (#2127 ) * fix: prefer enum for chat object * fix: adjust typo * fix: enum CompletionType not ObjectType * fix: adjust typo * feat: leverage serde for conditional deser * fix: adjust HubTokenizerConfig after rebase * fix: update create_post_processor logic for token type * fix: adjust unwrap syntax in template * Fixing the post processor. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 15:08:05 +02:00
Nicolas Patry	b4552f9de9	Fixing clippy. (#2149 )	2024-07-01 12:02:19 +02:00
Wang, Yi	6ea570ddfe	fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… (#2148 ) * fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_indices] Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-01 11:27:53 +02:00
drbh	74b0231b19	fix: refactor post_processor logic and add test (#2137 ) * fix: refactor post_processor logic and add test * fix: remove dev comment * fix: adjust when post_processor is overridden and improve create_post_processor	2024-06-27 23:16:19 +02:00
Nicolas Patry	0e4ab6d31c	Fixing malformed rust tokenizers (#2134 ) * Fixing malformed rust tokenizers * Fix for deepseek too.	2024-06-27 16:04:03 +02:00
Daniël de Kok	dd2d91b043	Idefics2: sync added image tokens with transformers (#2080 ) Before this change, the number of reserved image tokens was not the same as the number of images. Fixes #2029. While at it, also remove all the image token handling duplication in `prepare_input`.	2024-06-27 15:54:35 +02:00

1 2 3 4 5 ...

271 Commits