Commit Graph

1341 Commits

Author SHA1 Message Date
Nicolas Patry e497bc09f6
Minor fixes. (#3125) 2025-03-18 15:42:35 +01:00
Nicolas Patry 67ce543e04
Intel docker. (#3121)
* Intel docker.

* torchaudio ?

* Fixing dockerfile ?
2025-03-18 15:12:11 +01:00
Nicolas Patry 83fe45c15e
Prepare for patch release. (#3124) 2025-03-18 15:11:55 +01:00
Nicolas Patry 11f2eec10e
Publish nix docker image. (#3122)
* Publish nix docker image.

* Run during PR.

* Something else.

* Forgot to push.

* Build zstd.

* Pushing with skopeo

* Testing the PR.

* Runnign from nix.

* Cleaner tags.
2025-03-18 12:58:21 +01:00
Mohit Sharma a35fbdb925
Bug Fix: Sliding Window Attention (#3112)
* (fix) sliding window attention

* (fix) flashinfer

* (typo) collection link

* Add window_size_left param ipex rocm

* Update window size rocm flash decoding

* fix: bump snapshots and improve exceed window test case

* feat: add tests for image types and remove alpha from png

* Upgrading `from_env` to get token from file when necessary + fix
pali_gemma.

* fix: add pillow dependency and bump lock+requirements

* fix: bump org name in gemma3 test

* Fix qwen2.

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-03-18 10:37:33 +01:00
Baptiste Colle 8c2c348f3c
Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork (#3117)
feat(gaudi): add all the changes from tgi-gaudi fork up to PR #289
2025-03-18 09:45:52 +01:00
Daniël de Kok 095775e05c
launcher: correctly get the head dimension for VLMs (#3116)
* launcher: correctly get the head dimension for VLMs

For most (?) VLMs, the head dimension is in the `text_config`
configuration section. However, since we only queried the top-level
`head_dim` (which typically doesn't exist in VLMs), we would never use
flashinfer. This change adds a method that gets the head dimension from
the top-level `Config` struct or `text_config` when that fails.

* fix: bump org name in gemma3 test

---------

Co-authored-by: drbh <david.richard.holtz@gmail.com>
2025-03-17 18:19:37 +01:00
Wang, Yi 0b3e3db043
xpu 2.6 update (#3051)
* xpu 2.6 update

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* install whl

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update get xpu memory api

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* int

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix awq crash if modules_to_not_convert is None

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-17 13:48:48 +01:00
Daniël de Kok f91434e99b
Make the Nix-based Docker container work on non-NixOS (#3109)
On NixOS, the CUDA driver shim gets mounted on /run/opengl-driver,
where Nix packages expect the shim to be. However, on other
distributions, some FHS paths are mounted. This is a small change
to make the dynamic loader find the shim.
2025-03-13 14:02:45 +01:00
Nicolas Patry 8b91f92978
Fixing the docker build. (#3108)
* Fixing the docker build.

* Apply suggestions from code review
2025-03-13 11:26:44 +01:00
Baptiste Colle 27ed848676
Release of Gaudi Backend for TGI (#3091)
* feat(gaudi): release ready (docs, docker image and vlm ready)

* fix(gaudi): add default argument for the dockerfile

* fix(gaudi): remove use of latest for gaudi docker image + redid gaudi benchmarking section to include best practices
2025-03-13 10:56:01 +01:00
Nicolas Patry 83ef364177
We need gcc during runtime to enable triton to compile kernels. (#3103)
* We need gcc during runtime to enable triton to compile kernels.

* Fixing the docker build.
2025-03-13 10:45:47 +01:00
Daniël de Kok 83b7b7bb92
Router: add `gemma3-text` model type (#3107) 2025-03-13 10:41:33 +01:00
Daniël de Kok c73ae0bd88
Update to `kernels` 0.2.1 (#3084)
* Update to `kernels` 0.2.1

The package was renamed from `hf-kernels` to `kernels`. The new version
also updates the lockfile format.

* Download kernels in `install-cuda` target
2025-03-13 10:36:29 +01:00
Nicolas Patry d4c6faa67b
Try to fix on main CI color. (#3101) 2025-03-12 10:12:24 +01:00
Nicolas Patry 4ac06ddf56
Preparing relase 3.2.0 (#3100)
* Preparing relase 3.2.0

* Forgot the README.

* Update doc.
2025-03-12 10:11:33 +01:00
David Corvoysier f01dc9e743
Update neuron backend (#3098)
* feat(neuron): use AWS Neuron SDK 2.21.1

* feat(neuron): bump optimum-neuron version

* feat(neuron): tag latest image for local tests

* test(neuron): simplify sampling test
2025-03-12 09:53:15 +01:00
Nicolas Patry 5c5528e362
Fix tool call4 (#3094)
* Removing the no_tool content information.

* Removing a lot of NO_TOOL shenanigans.

* Update the tests.
2025-03-12 09:28:47 +01:00
Mohit Sharma ed46c2c414
Add gemma3 model (#3099) 2025-03-12 09:25:51 +01:00
Nicolas Patry f74c36fe0d
Fix tool call3 (#3086)
* Fixing the tool calling convention.

* Update tehe doc.

* Fixing some corner cases.

* Fixing the tool call id.

* Fmt.

* Snapshot update with the new updated tool_call_id.

* More qwen2.
2025-03-12 09:22:53 +01:00
celsowm ae4451c3da
Update README.md (#3095)
space between param and value
2025-03-11 11:05:21 +01:00
Nicolas Patry b447f7e821
Fix qwen vl (#3096)
* Fixing qwen2.5 VL.

* Fixing the CI.
2025-03-11 11:00:41 +01:00
Adrien Gallouët 094975c3a8
Update the llamacpp backend (#3022)
* Build faster

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Make --model-gguf optional

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Enable mmap, offload_kqv & flash_attention by default

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Better error message

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update doc

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update installed packages

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Save gguf in models/MODEL_ID/model.gguf

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Fix build with Mach-O

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Quantize without llama-quantize

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp and switch to ggml-org

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Remove make-gguf.sh

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Update Cargo.lock

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Support HF_HUB_USER_AGENT_ORIGIN

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Bump llama.cpp

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Add --build-arg llamacpp_native & llamacpp_cpu_arm_arch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-03-11 09:19:01 +01:00
drbh dc5f05f8e6
Pr 3003 ci branch (#3007)
* change ChatCompletionChunk to align with "OpenAI Chat Completions streaming API"

Moving after tool_calls2

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

add in Buffering..

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

fix: handle usage outside of stream state and add tests

Simplifying everything quite a bit.

Remove the unused model_dump.

Clippy.

Clippy ?

Ruff.

Uppgrade the flake for latest transformers.

Upgrade after rebase.

Remove potential footgun.

Fix completion test.

* Clippy.

* Tweak for multi prompt.

* Ruff.

* Update the snapshot a bit.

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
2025-03-10 17:56:19 +01:00
Daniël de Kok 124398fa57
hotfix: qwen2 formatting (#3093)
* hotfix: qwen2 formatting

* cargo fmt
2025-03-10 16:19:50 +01:00
Daniël de Kok c5ecc7a4de
Small test and typing fixes (#3078)
* test_weights: add modules_to_not_convert

* More typing fixes
2025-03-10 15:08:23 +01:00
jiqing-feng cae0cbe87d
Add modules_to_not_convert in quantized model (#3053)
* fix modules_to_not_convert

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix tp quant skip

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* revert unquantized changes

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* use DefaultWeightsLoader in skip modules

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
2025-03-10 15:03:51 +01:00
EachSheep bbe218a4f7
Add qwen2 multi lora layers support (#3089)
add qwen2 multi lora layers support to solve problem like https://github.com/huggingface/text-generation-inference/issues/2881, the similar PR are at https://github.com/huggingface/text-generation-inference/pull/2883

Co-authored-by: hjs <hjs@pku.edu.cn>
2025-03-10 12:42:59 +01:00
Alex Weston 58a65f7914
Add request parameters to OTel span for `/v1/chat/completions` endpoint (#3000)
Record request parameters in OTel span for /v1/chat/completions endpoint
2025-03-10 12:26:57 +01:00
Daniël de Kok 976eae216f
Nix: the launcher needs a Python env with Torch for GPU detection (#3085)
This makes `nix run .` in the repository work again. Should fix #3025.
2025-03-10 12:11:10 +01:00
Nicolas Patry 622908deab
Fix tool call2 (#3076)
* Making `tool_calls` a vector.

* Arguments output is a string.

* Update all the integration tests.

* Add the requirements.

* Upgrade other tests.

* Clippy.

* Update the old test.
2025-03-07 19:45:57 +01:00
Alvaro Bartolome 55a6618434
Update `--max-batch-total-tokens` description (#3083)
* Update `--max-batch-total-tokens` description

* Update docstring in `launcher/src/main.rs` instead
2025-03-07 14:24:26 +01:00
Daniël de Kok 036d802b62
Nix: add `openai` to impure shell for integration tests (#3081) 2025-03-07 13:04:21 +01:00
Nicolas Patry 8e92942a18
Making `tool_calls` a vector. (#3075)
* Making `tool_calls` a vector.

* Update doc.

* Fixing the nix overlay with updated version.

* Add openai dependency.

* Updating the old tests.

* Trying to reduce the logs in the case of errors.

* Less spammy logs too.
2025-03-05 22:32:31 +01:00
Nicolas Patry 3208d1cd1d
Revert "Trying to reduce the logs in the case of errors."
This reverts commit cdf70d6a28.
2025-03-05 20:52:38 +01:00
Nicolas Patry cdf70d6a28
Trying to reduce the logs in the case of errors. 2025-03-05 20:50:43 +01:00
Nicolas Patry ab9dafc68f
Making sure Olmo (transformers backend) works. (#3074) 2025-03-05 17:46:47 +01:00
Nicolas Patry 31766dad77
Force upgrade transformers version for olmo. 2025-03-05 12:17:09 +01:00
Nicolas Patry ec35976f82
Only add token when it is defined. (#3073)
* Only add token when it is defined.

* Update router/src/server.rs
2025-03-05 11:59:52 +01:00
David Corvoysier cb42b3ad83
fix(neuron): explicitly install toolchain (#3072)
* fix(neuron): explicitly install toolchain

* ci(neuron): trigger CI when Dockerfile is modified
2025-03-05 11:46:58 +01:00
Nicolas Patry 491ed9e11d
Patch rust release. (#3069)
* Patch rust release.

* Trying to remove the rust-toolchain hardcoded in action.

* Upgrade rust toolchain.

* Put back the toolchain ?

* Fix neuron dockerfile.

* Move to the proper version of Rust.

* 1.85 since the GH action doesn't respect the override.

* Typo.

* Fixing the github action.

* Fixing docker llamacpp.

* Fixing the github action.

* Update clippy.
2025-03-04 18:07:33 +01:00
Sadra Barikbin 144d99c147
Fix a tiny typo in `monitoring.md` tutorial (#3056)
Update monitoring.md
2025-03-04 17:06:26 +01:00
Nicolas Patry 08bbfa16a1
Preparing for release. (#3060)
* Preparing for release.

* Upgrade doc.

* Fix docs auto-generated.

* Fix update doc along.
2025-03-04 16:47:10 +01:00
Hugo Larcher d8ff7f2623
feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. (#3061)
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests.

* fix: Rust version for Neuron

* fix: PR comments, use rust-toolchain.toml
2025-03-04 16:43:50 +01:00
Daniël de Kok e88f6f6ee9
Add property-based testing for `RadixAllocator` (#3068) 2025-03-04 15:09:46 +01:00
Daniël de Kok fa4e9511f8
Fix two edge cases in `RadixTrie::find` (#3067)
- Always return a node, not its parent.
- Do not recurse when a node does not represent a full prefix of the
  input.
2025-03-04 13:23:27 +01:00
Nicolas Patry a914a21899
Revert "Patch rust release."
This reverts commit aad9c2b0bd.
2025-03-04 12:16:18 +00:00
Nicolas Patry aad9c2b0bd
Patch rust release. 2025-03-04 12:14:58 +00:00
Nicolas Patry 1f35cc7a31
Updating patch rust release. 2025-03-04 12:13:58 +00:00
Baptiste Colle 683ff53fa3
Add Gaudi Backend (#3055)
* wip(gaudi): import server and dockerfile from tgi-gaudi fork

* feat(gaudi): new gaudi backend working

* fix: fix style

* fix prehooks issues

* fix(gaudi): refactor server and implement requested changes
2025-02-28 12:14:58 +01:00