* feat: process token stream before returning to client
* fix: expect content in test
* fix: improve comparison via ruff lint
* fix: return event in all cases
* fix: always send event on error, avoid unwraps, refactor and improve tests
* fix: prefer no_tool over notify_error to improve reponse
* fix: adjust chat input test for no_tool
* fix: adjust test expected content
---------
Co-authored-by: System administrator <root@ip-10-90-0-186.ec2.internal>
To make sure that everything is formatted with the same black version
as CI.
I sometimes use isort for new files to get nicely ordered imports,
so add it as well. Also set the isort configuration to format in a
way that is compatible with black.
* Add basic FP8 KV cache support
This change adds rudimentary FP8 KV cache support. The support is
enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
uses this type for the KV cache. However support is still limited:
* Only the `fp8_e5m2` type is supported.
* The KV cache layout is the same as `float16`/`bfloat16` (HND).
* The FP8 KV cache is only supported for FlashInfer.
* Loading of scales is not yet supported.
* Fix Cargo.toml
* feat: unroll notify_error if no tool is choosen
* fix: expect simple message when no tool is selected
* fix: improve test to avoid notify_error
* fix: improve docs and indicate change in expected response
* fix: adjust linting in test file
* adding max_token_capacity_metric
* added tgi to name of metric
* Adding max capacity metric.
* Add description for the metrics
---------
Co-authored-by: Edwinhr716 <Edandres249@gmail.com>
* Working loading state.
* Preprocessing.
* Working state ? (Broke idefics1 temporarily).
* Cleaner condition.
* Fix idefics.
* Updating config, removing TODO
* Mllama
* Ugrade transformers 4.45
* Flashing mllama.
* Starting to get there.
* Working state.
* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.
* Updating model link.
* Earlier assert.
* Fix vlm ?
* remove log.
* Force ignore all images but last.
* Default dtype bfloat16.
* Update integration test after switch to bf16.
* Remove dead code.
* Removed dead code.
* Upgrade the flake to latest transformers/tokenizers
* Move to hf tgi-nix
* Upgrade to 0.5.0
* nix: experimental support for building a Docker image
Run using something like:
```
docker run \
--device nvidia.com/gpu=all \
-it --rm -p 8080:80 \
-v $PWD/data:/data \
-v $PWD/tmp:/tmp \
tgi-docker:latest \
--model-id <model_id>
```
* Example of building the Docker image using Nix inside Docker
* Stream to make the builder image smaller
This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.
* Don't spam journalctl on Linux
* Other dockerfile.
---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
* feat: support phi3.5 moe model loading
* fix: prefer llama base model and improve rotary logic
* feat: return reasonable generation and add integration test
* fix: run lint and update docs
* fix: rerun lint for openapi docs
* fix: prefer do_sample false unless temp is set by user, and update chat tests
* fix: small typo adjustments
* fix: consolidate long rope paths
* fix: revert greedy by default and test changes
* Vendor configuration so that we don't have to `trust_remote_code`
* Use SparseMoELayer
* Add support for dense MoE
* Some type annotations
* Add the usual model tests
* Ruff.
---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:
- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.
Remove compute capability lock
We are only calling the `get_cuda_capability` function once, so avoiding
the cost of multiple calls is not really necessary yet.
* Improve support for GPUs with capability < 8
- For models that cannot use flashinfer, use flash-attn v1 + paged
attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
cache, since v1 cannot use block tables.
* nix: add flash-attn-v1 to the server environment
* Move disabling prefix caching into the block of exceptions
* Capability as `usize`s
* Add support for scalar FP8 weight scales
* Support LLM compressor FP8 checkpoints on H100
On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype.
However, we wouldn't pick up fp8 quantization for models quantized with
LLM compressor. This change adds enough parsing to detect if models have
FP8-quantized weights.
* Remove stray debug print
* Stream options.
* Fetch stuff from nix integration test for easier testing.
* Adding the assert.
* Only send the usage when asked for.
* Update the docs.
* Impure test because we need network.
* develop.
* Optional usage.
* Fixes.
* Workflow