hf_text-generation-inference

Commit Graph

Author	SHA1	Message	Date
Nicholas Broad	7efcb5e0ed	remove LORA_ADAPTERS_PATH (#2563 ) specify how to call local adapters	2024-09-25 01:20:15 +02:00
Aritra Roy Gosthipaty	e6d29656b5	Adding note for private models in quick-tour document (#2548 ) * chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by: vb <vaibhavs10@gmail.com> --------- Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> Co-authored-by: vb <vaibhavs10@gmail.com>	2024-09-24 15:06:53 +02:00
Nicolas Patry	169178b937	Preparing for release. (#2540 ) * Preparing for release. * Upgrade version in docs.	2024-09-20 17:42:04 +02:00
Daniël de Kok	abd24dd385	doc: clarify that `--quantize` is not needed for pre-quantized models (#2536 )	2024-09-19 22:17:15 +02:00
Martin Iglesias Goyanes	aaea212d0f	Add links to Adyen blogpost (#2500 ) * Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-09-06 17:00:54 +02:00
Nicolas Patry	8b96a18265	Adding links to Adyen blogpost. (#2492 )	2024-09-05 16:11:52 +02:00
Wang, Yi	9883f3b40e	update doc with intel cpu part (#2420 ) * update doc with intel cpu part Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release. --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-29 17:42:02 +02:00
drbh	8f99f165ce	fix: improve regex expression (#2468 )	2024-08-28 13:44:44 -04:00
Hugo Larcher	53729b74ac	doc: Add metrics documentation and add a 'Reference' section (#2230 ) * doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-08-16 19:43:30 +02:00
Nicolas Patry	cb0a29484d	FIxing the CI.	2024-08-16 14:21:29 +02:00
Vaibhav Srivastav	99b662f8c2	Improve the Consuming TGI + Streaming docs. (#2412 ) * Improve the Consuming TGI docs. * Fix erronous update to . * add info about Open AI client. * More updates. * Apply suggestions from code review Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> * Suggestions from Lucain. * Update Gradio snippet. * Up. * Apply suggestions from code review Co-authored-by: Lucain <lucainp@gmail.com> * Update docs/source/basic_tutorials/consuming_tgi.md Co-authored-by: Lucain <lucainp@gmail.com> * Up. * Apply suggestions from code review Co-authored-by: Omar Sanseviero <osanseviero@gmail.com> * Up. * Up. * Doc review from Nico. * Doc review from Nico. x2 * Last nit --------- Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>	2024-08-16 12:43:08 +02:00
Nicolas Patry	7a48a84784	Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385 ) * Using an enum for flash backens (paged/flashdecoding/flashinfer) * Early exit on server too. * Clippy. * Fix clippy and fmt.	2024-08-09 16:41:17 +02:00
Vaibhav Srivastav	b2b9c42724	Update documentation for Supported models (#2386 ) * Minor doc fixes * up. * Other minor updates.	2024-08-09 15:01:34 +02:00
Vaibhav Srivastav	cb3ae30284	Update Quantization docs and minor doc fix. (#2368 ) * Update Quantization docs and minor doc fix. * update readme with latest quants info * Apply suggestions from code review Co-authored-by: Pedro Cuenca <pedro@huggingface.co> * up --------- Co-authored-by: Pedro Cuenca <pedro@huggingface.co>	2024-08-08 16:06:57 -04:00
drbh	21267f3ca3	add gptj modeling in TGI #2366 (CI RUN) (#2372 ) * add gptj modeling Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix: update docs for model addition * fix: adjust syntax typo * fix: adjust syntax typo again --------- Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>	2024-08-07 21:32:37 -04:00
drbh	dd47a3dac4	feat: include local lora adapter loading docs (#2359 )	2024-08-05 12:36:44 -04:00
Erik Kaunismäki	7451041ecd	refactor usage stats (#2339 ) * refactor usage stats * Update docs/source/usage_statistics.md Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * changes based on feedback * run python3 udpate_doc.py * fix pre-commit * Update router/src/server.rs Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> * delete option around usage stats arg --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-07-31 16:29:07 +02:00
Erik Kaunismäki	583d37a2f8	Run ci api key (#2315 ) * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints. * change name to info routes * Fix comment * convert strings to lowercase for case insensitive comparison * convert header to string * fixes and update docs * update docs again * revert wrong update --------- Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>	2024-07-29 11:14:17 +02:00
Nicolas Patry	5d121a9705	Preparing for release. (#2285 ) * Preparing for release. * Updating docs. * Fixing token within the docker image for the launcher.	2024-07-23 16:20:17 +02:00
Daniël de Kok	e52be9bba2	Add support for Deepseek V2 (#2224 ) Deepseek V2 is a MoE model from Deepseek. Relevant variations compared to other models: - Grouped top-K in expert selection. - mscale in yarn is calculated using the `mscale` and `mscale_all_dim` configuration options. - `mscale_all_dim` is also used in scaling attention softmax. - Permuting of the query/key representations before applying rotary embeddings. - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`). So, we need weight loads that supports quantized weights. To this end `{Weights,WeightLoader}.get_weight` was added. - The query/key head dimensionality differs from that of the value, so we need to pad during attention. - Heads with size 192, needs an extension to our paged attention fork and we need to ensure that the KV cache is allocated with the correct size. - Shared experts.	2024-07-19 17:23:20 +02:00
Erik Kaunismäki	40f5dc3ed6	add usage stats to toctree (#2260 ) quick fix	2024-07-19 16:34:04 +02:00
Erik Kaunismäki	4c19593a90	usage stats and crash reports (#2220 ) * draft of usage stats * fix wrong link * launcher doesn't need sysinfo dep * only tokenizer class instead of hole struct * unused import * fix clippy errors * update openAPI doc * cargo fmt * fix error in passing flags to router * try again to update docs * run pre-commit locally * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * Update router/src/main.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * on crash use anonymous error event * delete json_output and ngrok * more robust way of checking if is in container * more robust nvidia smi * parse xpu more robustly * fix errors * add nvidia-smi details in docs * cargo fmt * fix clippy * should make docs check pass * Update router/src/usage_stats.rs Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> * error reason can't be in nested json * cargo fmt --------- Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co> Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>	2024-07-19 16:17:56 +02:00
Nicolas Patry	4c976fb406	Updating the self check (#2209 ) * Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.	2024-07-09 17:23:48 +02:00
Nicolas Patry	fe710af25f	Adding sanity check to openapi docs.	2024-07-09 11:13:48 +02:00
Wang, Yi	07e240ca37	add doc for intel gpus (#2181 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>	2024-07-08 15:57:06 +02:00
Nicolas Patry	fb2f74e2b9	Refactor dead code - Removing all `flash_xxx.py` files. (#2166 ) * Refactor dead code. * First working step. * Remove a lot of duplicated code. * More dead code. * More cleanup. * Fix Santacoder test. * Fixing the simple tests. * Fixing sharding. * Fixes for VLM. * Fixing santacoder (num_kv_heads hardcoded). * Removing more dead code. * Fixing `config.n_head`. * Stopping earlier because of `<end_of_utterance>` in idefics2. * Addresses comments. * Removing the dead code. * Fuse back mistral into FlashCausalLM. * Finish removal. * Fixing docs + causal_lm `batch_class`. * Fixing docs + causal.lm. * Add default to Gemma Causality. * Default value for gemma/gemma2. * Wrong default.	2024-07-05 10:29:56 +02:00
Nicolas Patry	245d3de948	Preparing patch release. (#2186 )	2024-07-04 10:55:33 +02:00
Nicolas Patry	3ea8259af1	Fixing gemma2. (#2135 ) * Fixing gemma2. * Adding new model.	2024-06-27 16:04:20 +02:00
Nicolas Patry	b53b21c63a	Bumping to 2.1 (#2131 )	2024-06-27 12:34:43 +02:00
drbh	04e1af94d7	Enable multiple LoRa adapters (#2010 ) * feat: first draft load multiple lora * feat: load weights within layer and refactor lora pass * fix: refactor and reduce lora math * feat: baseline impl single request multi lora support * feat: prefer lorax implementation and port loading logic * fix: prefer adapter_data and refactors * feat: perfer loraxs custom punica kernels and add mlp loras * fix: adjust batch for bgmv * fix: adjust adapter_segments logic when in batch * fix: refactor and move changes to v3 proto * fix: pass model_id for all flash causal lms * fix: pass model_id for all causal and seq2seq lms * fix: add model_id to model test * feat: add lora support to mistral and refactors * feat: prefer model id in request * fix: include rust code for adapter id * feat: bump launcher and add new lora docs * feat: support base model generation and refactors * fix: rename doc to retry ci build * feat: support if vlm models * fix: add adapter_data param and avoid missing layers * fix: add adapter_data param to phi and neox * fix: update all models forwards to include adapter_data * fix: add model_id to IdeficsCausalLM * Update lora.md Fixed a typo * Update lora.md Fixing spam image * fix: add lora kernel to dockerfile, support running without kernels and refactors * fix: avoid dockerfile conflict * fix: refactors and adjust flash llama lora logic * fix: skip llama test due to CI issue (temp) * fix: skip llama test CI (temp) 2 * fix: revert skips and prefer updated ci token for tests * fix: refactors and helpful comments * fix: add noop in TensorParallelAdapterRowLinear too * fix: refactor and move shard_lora_weights logic * fix: exit early if no adapter_data --------- Co-authored-by: Derek <datavistics@gmail.com>	2024-06-25 14:46:27 -04:00
KevinDuffy94	1869ee2f57	Add OTLP Service Name Environment Variable (#2076 ) * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option	2024-06-25 09:33:01 +02:00
Lucain	3447c722fd	Support `HF_TOKEN` environment variable (#2066 ) * Support HF_TOKEN environement variable * Load test. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-06-25 09:23:12 +02:00
Alvaro Moran	445f313504	Adding architecture document (#2044 ) * doc: adding architecture document * doc: add architecture to toctree * fix: avoid cargo lock changes * fix: avoid cargo lock tweak --------- Co-authored-by: drbh <david.richard.holtz@gmail.com>	2024-06-14 09:28:34 -04:00
Tiezhen WANG	96b7b40ca3	Update the link for qwen2 (#2068 ) * Update the link for qwen2 * Fix Qwen2 model URL in model table * Fix too eager staging --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-06-14 11:59:33 +02:00
Daniël de Kok	4594e6faba	Add support for Marlin-quantized models This change adds support for Marlin-quantized models. Marlin is an FP16xINT4 matmul kernel, which provides good speedups decoding batches of 16-32 tokens. It supports quantized models with symmetric quantization, groupsize -1 or 128, and 4-bit. Tested with: - Llama 2 - Llama 3 - Phi 3	2024-06-06 13:16:52 +02:00
Emmanuel Ferdman	fec0167a12	fix: update triton implementation reference (#2002 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> PR #1986 moved the location of the `flash_attn_triton.py` file. This PR adjusts sources to changes. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>	2024-06-04 14:26:35 +02:00
Nicholas Broad	08b3eac2ce	single char ` addition for docs (#1989 ) # What does this PR do? I think this will fix the docs from being weirdly formatted. All the sections after MAX_TOP_N_TOKENS don't show up in the bar on the right (https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens) ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? @merveenoyan --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-05-31 18:42:14 +02:00
fxmarty	659bd67fec	Update documentation version to 2.0.4 (#1980 ) As per title cc @Narsil	2024-05-31 16:03:24 +02:00
Daniël de Kok	36dd16017c	Add support for exl2 quantization Mostly straightforward, changes to existing code: * Wrap quantizer parameters in a small wrapper to avoid passing around untyped tuples and needing to repack them as a dict. * Move scratch space computation to warmup, because we need the maximum input sequence length to avoid allocating huge scratch buffers that OOM.	2024-05-30 11:28:05 +02:00
Nicolas Patry	e76b9824ae	Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). (#1959 ) - Axum upgraded to hyper 1.0 and most of the ecosystem switched so it's our time now - [ngrok-rust](https://github.com/ngrok/ngrok-rust/pull/137/files) hasn't yet, and hasn't for several months now, so let's disabled the feature for the time being. # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-28 14:52:17 +02:00
Moritz Laurer	b7ffa287f2	fix small typo and broken link (#1958 ) # What does this PR do? Fix a typo; fix a broken link; add one sentence in the guidance docs to make the word "grammar" less abstract ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @drbh	2024-05-27 11:31:06 -04:00
drbh	a103e3e9e2	feat: add train medusa head tutorial (#1934 ) This PR adds a tutorial to self distill and train medusa heads for a specific model --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>	2024-05-23 11:34:18 +02:00
Nicolas Patry	2f243a1a15	Creating doc automatically for supported models. (#1929 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-22 16:22:57 +02:00
Junlin Zhou	904ff36917	docs: Fix grafana dashboard url (#1925 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes an incorrect url in monitoring doc. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-21 13:12:14 -04:00
fxmarty	293b8125e7	ROCm: make CK FA2 default instead of Triton (#1924 ) As per title. Triton autotune overhead is prohibitive, as it needs to be done for each different prompt length.	2024-05-20 08:44:48 +08:00
fxmarty	b5f1c9de06	Fix TunableOp bug (#1920 ) cc @Narsil	2024-05-17 18:21:51 +02:00
fxmarty	c4cf8b49d1	Add TGI monitoring guide through Grafana and Prometheus (#1908 ) As per title. It is very useful.	2024-05-17 10:34:44 -04:00
fxmarty	232e8d5227	MI300 compatibility (#1764 ) Adds support for AMD Instinct MI300 in TGI. Most changes are: * Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable. TunableOp is disabled by default, and can be enabled with `PYTORCH_TUNABLEOP_ENABLED=1`. * Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes from https://github.com/pytorch/pytorch/pull/124362) * Support SILU & Linear custom kernels contributed by AMD * Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/, branching out of a much more recent commit `3489ce7936` * Support FA2 Triton kernel as recommended by AMD. Can be used by specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`. * Update dockerfile to ROCm 6.1 By default, TunableOp tuning results are saved in `/data` (e.g. `/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order to avoid to have to rerun the tuning at each `docker run`. Example: ``` Validator,PT_VERSION,2.3.0 Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c Validator,HIPBLASLT_VERSION,0.7.0-1549b021 Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack- Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098 GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431 GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546 GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119 GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645 GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971 GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694 GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522 GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671 GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834 GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622 GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122 GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191 GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514 GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914 GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516 GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953 GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043 GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497 GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895 GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716 GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731 GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816 GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701 GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159 GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524 GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074 GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045 GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582 GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705 GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489 ``` --------- Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>	2024-05-17 15:30:47 +02:00
Daniël de Kok	b5bc6e5c4e	Add GPT-2 with flash attention (#1889 ) # What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> This change adds `FlashGPT2ForCausalLM` and wires it up. The model itself is pretty straightforward, the main difference from other models is that it uses trained position embeddings and that all weight matrices are transposed compared to other models (due to the use of Conv1D in the upstream model). <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [x] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [x] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @Narsil <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->	2024-05-15 13:31:22 +02:00
Brandon Lockaby	92f1338b84	Correct 'using guidance' link (#1892 ) Fix typo in link to 'using guidance' article	2024-05-14 14:23:39 -04:00

1 2 3

130 Commits