Commit Graph

141 Commits

Author SHA1 Message Date
Alvaro Bartolome 74489227e0
Add Google Cloud in `docs/source/references/api_reference.md` 2024-10-05 16:54:17 +02:00
Alvaro Bartolome 47c01cb048
Fix AWS Sagemaker indentation, typo and header level 2024-10-05 13:26:25 +02:00
Daniël de Kok 2358c2bb54
Add basic FP8 KV cache support (#2603)
* Add basic FP8 KV cache support

This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:

* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.

* Fix Cargo.toml
2024-10-04 17:51:48 +02:00
drbh 3011639ff7
Revert "Unroll notify error into generate response" (#2605)
Revert "Unroll notify error into generate response (#2597)"

This reverts commit d22b0c1fbe.
2024-10-03 17:56:40 -04:00
Nicolas Patry f6e2f05b16
New release 2.3.1 (#2604)
* New release 2.3.1

* Update doc number
2024-10-03 14:43:49 +02:00
drbh d22b0c1fbe
Unroll notify error into generate response (#2597)
* feat: unroll notify_error if no tool is choosen

* fix: expect simple message when no tool is selected

* fix: improve test to avoid notify_error

* fix: improve docs and indicate change in expected response

* fix: adjust linting in test file
2024-10-02 11:34:57 -04:00
drbh 2335459556
CI (2592): Allow LoRA adapter revision in server launcher (#2602)
allow revision for lora adapters from launcher

Co-authored-by: Sida <sida@kulamind.com>
Co-authored-by: teamclouday <teamclouday@gmail.com>
2024-10-02 10:51:04 -04:00
Nicolas Patry d18ed5cfc5
Mllama flash version (#2585)
* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0
2024-10-02 11:22:13 +02:00
drbh 93a7042d7e
feat: support phi3.5 moe (#2479)
* feat: support phi3.5 moe model loading

* fix: prefer llama base model and improve rotary logic

* feat: return reasonable generation and add integration test

* fix: run lint and update docs

* fix: rerun lint for openapi docs

* fix: prefer do_sample false unless temp is set by user, and update chat tests

* fix: small typo adjustments

* fix: consolidate long rope paths

* fix: revert greedy by default and test changes

* Vendor configuration so that we don't have to `trust_remote_code`

* Use SparseMoELayer

* Add support for dense MoE

* Some type annotations

* Add the usual model tests

* Ruff.

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-30 11:15:09 +02:00
Mohit Sharma f9e561eced
Update ROCM libs and improvements (#2579)
* style

* update torch

* ix issues

* fix clone

* revert mkl

* added custom PA

* style

* fix style

* style

* hide env vart

* fix mixtral model

* add skinny kernel and merge fixes

* fixed style

* fix issue for sliding window models

* addressed review comments

* fix import

* improved error messag

* updated default value

* remove import

* fix imports after rebase

* float16 dep

* improve dockerfile

* cleaned dockerfile
2024-09-30 10:54:32 +02:00
Ikram Ul Haq e790cfc0e4
Update architecture.md (#2577) 2024-09-30 08:56:20 +02:00
Nicholas Broad 7efcb5e0ed
remove LORA_ADAPTERS_PATH (#2563)
specify how to call local adapters
2024-09-25 01:20:15 +02:00
Aritra Roy Gosthipaty e6d29656b5
Adding note for private models in quick-tour document (#2548)
* chore: adding note for private models in quicktour doc

* Update docs/source/quicktour.md

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

* Update docs/source/quicktour.md

Co-authored-by: vb <vaibhavs10@gmail.com>

---------

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: vb <vaibhavs10@gmail.com>
2024-09-24 15:06:53 +02:00
Nicolas Patry 169178b937
Preparing for release. (#2540)
* Preparing for release.

* Upgrade version in docs.
2024-09-20 17:42:04 +02:00
Daniël de Kok abd24dd385
doc: clarify that `--quantize` is not needed for pre-quantized models (#2536) 2024-09-19 22:17:15 +02:00
Martin Iglesias Goyanes aaea212d0f
Add links to Adyen blogpost (#2500)
* Add links to Adyen blogpost

* Adding to toctree.

* Update external.md

* Update _toctree.yml

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-09-06 17:00:54 +02:00
Nicolas Patry 8b96a18265
Adding links to Adyen blogpost. (#2492) 2024-09-05 16:11:52 +02:00
Wang, Yi 9883f3b40e
update doc with intel cpu part (#2420)
* update doc with intel cpu part

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* Apply suggestions from code review

we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-29 17:42:02 +02:00
drbh 8f99f165ce
fix: improve regex expression (#2468) 2024-08-28 13:44:44 -04:00
Hugo Larcher 53729b74ac
doc: Add metrics documentation and add a 'Reference' section (#2230)
* doc: Add metrics documentation and add a 'Reference' section

* doc: Add API reference

* doc: Refactor API reference

* fix: Message API link

* Bad rebase

* Moving the docs.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-08-16 19:43:30 +02:00
Nicolas Patry cb0a29484d
FIxing the CI. 2024-08-16 14:21:29 +02:00
Vaibhav Srivastav 99b662f8c2
Improve the Consuming TGI + Streaming docs. (#2412)
* Improve the Consuming TGI docs.

* Fix erronous update to .

* add info about Open AI client.

* More updates.

* Apply suggestions from code review

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>

* Suggestions from Lucain.

* Update Gradio snippet.

* Up.

* Apply suggestions from code review

Co-authored-by: Lucain <lucainp@gmail.com>

* Update docs/source/basic_tutorials/consuming_tgi.md

Co-authored-by: Lucain <lucainp@gmail.com>

* Up.

* Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>

* Up.

* Up.

* Doc review from Nico.

* Doc review from Nico. x2

* Last nit

---------

Co-authored-by: Erik Kaunismäki <erik.kaum@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
2024-08-16 12:43:08 +02:00
Nicolas Patry 7a48a84784
Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385)
* Using an enum for flash backens (paged/flashdecoding/flashinfer)

* Early exit on server too.

* Clippy.

* Fix clippy and fmt.
2024-08-09 16:41:17 +02:00
Vaibhav Srivastav b2b9c42724
Update documentation for Supported models (#2386)
* Minor doc fixes

* up.

* Other minor updates.
2024-08-09 15:01:34 +02:00
Vaibhav Srivastav cb3ae30284
Update Quantization docs and minor doc fix. (#2368)
* Update Quantization docs and minor doc fix.

* update readme with latest quants info

* Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

* up

---------

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2024-08-08 16:06:57 -04:00
drbh 21267f3ca3
add gptj modeling in TGI #2366 (CI RUN) (#2372)
* add gptj modeling

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: update docs for model addition

* fix: adjust syntax typo

* fix: adjust syntax typo again

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
2024-08-07 21:32:37 -04:00
drbh dd47a3dac4
feat: include local lora adapter loading docs (#2359) 2024-08-05 12:36:44 -04:00
Erik Kaunismäki 7451041ecd
refactor usage stats (#2339)
* refactor usage stats

* Update docs/source/usage_statistics.md

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* Update router/src/server.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* changes based on feedback

* run python3 udpate_doc.py

* fix pre-commit

* Update router/src/server.rs

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* delete option around usage stats arg

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-07-31 16:29:07 +02:00
Erik Kaunismäki 583d37a2f8
Run ci api key (#2315)
* Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.

* change name to info routes

* Fix comment

* convert strings to lowercase for case insensitive comparison

* convert header to string

* fixes and update docs

* update docs again

* revert wrong update

---------

Co-authored-by: Kevin Duffy <kevin.duffy94@gmail.com>
2024-07-29 11:14:17 +02:00
Nicolas Patry 5d121a9705
Preparing for release. (#2285)
* Preparing for release.

* Updating docs.

* Fixing token within the docker image for the launcher.
2024-07-23 16:20:17 +02:00
Daniël de Kok e52be9bba2
Add support for Deepseek V2 (#2224)
Deepseek V2 is a MoE model from Deepseek. Relevant variations
compared to other models:

- Grouped top-K in expert selection.
- mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
  configuration options.
- `mscale_all_dim` is also used in scaling attention softmax.
- Permuting of the query/key representations before applying rotary
  embeddings.
- Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
  So, we need weight loads that supports quantized weights. To this
  end `{Weights,WeightLoader}.get_weight` was added.
- The query/key head dimensionality differs from that of the value,
  so we need to pad during attention.
- Heads with size 192, needs an extension to our paged attention
  fork and we need to ensure that the KV cache is allocated with the
  correct size.
- Shared experts.
2024-07-19 17:23:20 +02:00
Erik Kaunismäki 40f5dc3ed6
add usage stats to toctree (#2260)
quick fix
2024-07-19 16:34:04 +02:00
Erik Kaunismäki 4c19593a90
usage stats and crash reports (#2220)
* draft of usage stats

* fix wrong link

* launcher doesn't need sysinfo dep

* only tokenizer class instead of hole struct

* unused import

* fix clippy errors

* update openAPI doc

* cargo fmt

* fix error in passing flags to router

* try again to update docs

* run pre-commit locally

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* Update router/src/main.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* on crash use anonymous error event

* delete json_output and ngrok

* more robust way of checking if is in container

* more robust nvidia smi

* parse xpu more robustly

* fix errors

* add nvidia-smi details in docs

* cargo fmt

* fix clippy

* should make docs check pass

* Update router/src/usage_stats.rs

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>

* error reason can't be in nested json

* cargo fmt

---------

Co-authored-by: Hugo Larcher <hugo.larcher@huggingface.co>
Co-authored-by: Erik Kaunismäki <erikkaum@Eriks-MacBook-Pro.local>
2024-07-19 16:17:56 +02:00
Nicolas Patry 4c976fb406
Updating the self check (#2209)
* Updating the self check

* Fix.

* Revert the CLI .

* cli.

* Space.

* Revert cargo update.
2024-07-09 17:23:48 +02:00
Nicolas Patry fe710af25f
Adding sanity check to openapi docs. 2024-07-09 11:13:48 +02:00
Wang, Yi 07e240ca37
add doc for intel gpus (#2181)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-07-08 15:57:06 +02:00
Nicolas Patry fb2f74e2b9
Refactor dead code - Removing all `flash_xxx.py` files. (#2166)
* Refactor dead code.

* First working step.

* Remove a lot of duplicated code.

* More dead code.

* More cleanup.

* Fix Santacoder test.

* Fixing the simple tests.

* Fixing sharding.

* Fixes for VLM.

* Fixing santacoder (num_kv_heads hardcoded).

* Removing more dead code.

* Fixing `config.n_head`.

* Stopping earlier because of `<end_of_utterance>` in idefics2.

* Addresses comments.

* Removing the dead code.

* Fuse back mistral into FlashCausalLM.

* Finish removal.

* Fixing docs + causal_lm `batch_class`.

* Fixing docs + causal.lm.

* Add default to Gemma Causality.

* Default value for gemma/gemma2.

* Wrong default.
2024-07-05 10:29:56 +02:00
Nicolas Patry 245d3de948
Preparing patch release. (#2186) 2024-07-04 10:55:33 +02:00
Nicolas Patry 3ea8259af1
Fixing gemma2. (#2135)
* Fixing gemma2.

* Adding new model.
2024-06-27 16:04:20 +02:00
Nicolas Patry b53b21c63a
Bumping to 2.1 (#2131) 2024-06-27 12:34:43 +02:00
drbh 04e1af94d7
Enable multiple LoRa adapters (#2010)
* feat: first draft load multiple lora

* feat: load weights within layer and refactor lora pass

* fix: refactor and reduce lora math

* feat: baseline impl single request multi lora support

* feat: prefer lorax implementation and port loading logic

* fix: prefer adapter_data and refactors

* feat: perfer loraxs custom punica kernels and add mlp loras

* fix: adjust batch for bgmv

* fix: adjust adapter_segments logic when in batch

* fix: refactor and move changes to v3 proto

* fix: pass model_id for all flash causal lms

* fix: pass model_id for all causal and seq2seq lms

* fix: add model_id to model test

* feat: add lora support to mistral and refactors

* feat: prefer model id in request

* fix: include rust code for adapter id

* feat: bump launcher and add new lora docs

* feat: support base model generation and refactors

* fix: rename doc to retry ci build

* feat: support if vlm models

* fix: add adapter_data param and avoid missing layers

* fix: add adapter_data param to phi and neox

* fix: update all models forwards to include adapter_data

* fix: add model_id to IdeficsCausalLM

* Update lora.md

Fixed a typo

* Update lora.md

Fixing spam image

* fix: add lora kernel to dockerfile, support running without kernels and refactors

* fix: avoid dockerfile conflict

* fix: refactors and adjust flash llama lora logic

* fix: skip llama test due to CI issue (temp)

* fix: skip llama test CI (temp) 2

* fix: revert skips and prefer updated ci token for tests

* fix: refactors and helpful comments

* fix: add noop in TensorParallelAdapterRowLinear too

* fix: refactor and move shard_lora_weights logic

* fix: exit early if no adapter_data

---------

Co-authored-by: Derek <datavistics@gmail.com>
2024-06-25 14:46:27 -04:00
KevinDuffy94 1869ee2f57
Add OTLP Service Name Environment Variable (#2076)
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069

* Update Docs

* Update README.md

* Update Launcher Docs

* Update Launcher Docs
Removing Option
2024-06-25 09:33:01 +02:00
Lucain 3447c722fd
Support `HF_TOKEN` environment variable (#2066)
* Support HF_TOKEN environement variable

* Load test.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-06-25 09:23:12 +02:00
Alvaro Moran 445f313504
Adding architecture document (#2044)
* doc: adding architecture document

* doc: add architecture to toctree

* fix: avoid cargo lock changes

* fix: avoid cargo lock tweak

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
2024-06-14 09:28:34 -04:00
Tiezhen WANG 96b7b40ca3
Update the link for qwen2 (#2068)
* Update the link for qwen2

* Fix Qwen2 model URL in model table

* Fix too eager staging

---------

Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-06-14 11:59:33 +02:00
Daniël de Kok 4594e6faba Add support for Marlin-quantized models
This change adds support for Marlin-quantized models. Marlin is an
FP16xINT4 matmul kernel, which provides good speedups decoding batches
of 16-32 tokens. It supports quantized models with symmetric
quantization, groupsize -1 or 128, and 4-bit.

Tested with:

- Llama 2
- Llama 3
- Phi 3
2024-06-06 13:16:52 +02:00
Emmanuel Ferdman fec0167a12
fix: update triton implementation reference (#2002)
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

PR #1986 moved the location of the `flash_attn_triton.py` file. This PR
adjusts sources to changes.


## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2024-06-04 14:26:35 +02:00
Nicholas Broad 08b3eac2ce
single char ` addition for docs (#1989)
# What does this PR do?

I think this will fix the docs from being weirdly formatted. All the
sections after MAX_TOP_N_TOKENS don't show up in the bar on the right
(https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtopntokens)


## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

@merveenoyan

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2024-05-31 18:42:14 +02:00
fxmarty 659bd67fec
Update documentation version to 2.0.4 (#1980)
As per title

cc @Narsil
2024-05-31 16:03:24 +02:00
Daniël de Kok 36dd16017c Add support for exl2 quantization
Mostly straightforward, changes to existing code:

* Wrap quantizer parameters in a small wrapper to avoid passing
  around untyped tuples and needing to repack them as a dict.
* Move scratch space computation to warmup, because we need the
  maximum input sequence length to avoid allocating huge
  scratch buffers that OOM.
2024-05-30 11:28:05 +02:00