2023-08-10 02:24:52 -06:00
|
|
|
- sections:
|
|
|
|
- local: index
|
|
|
|
title: Text Generation Inference
|
|
|
|
- local: quicktour
|
|
|
|
title: Quick Tour
|
MI300 compatibility (#1764)
Adds support for AMD Instinct MI300 in TGI.
Most changes are:
* Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding
https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable.
TunableOp is disabled by default, and can be enabled with
`PYTORCH_TUNABLEOP_ENABLED=1`.
* Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes
from https://github.com/pytorch/pytorch/pull/124362)
* Support SILU & Linear custom kernels contributed by AMD
* Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/,
branching out of a much more recent commit
https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308
* Support FA2 Triton kernel as recommended by AMD. Can be used by
specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`.
* Update dockerfile to ROCm 6.1
By default, TunableOp tuning results are saved in `/data` (e.g.
`/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order
to avoid to have to rerun the tuning at each `docker run`.
Example:
```
Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489
```
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2024-05-17 07:30:47 -06:00
|
|
|
- local: installation_nvidia
|
|
|
|
title: Using TGI with Nvidia GPUs
|
|
|
|
- local: installation_amd
|
|
|
|
title: Using TGI with AMD GPUs
|
|
|
|
- local: installation_gaudi
|
|
|
|
title: Using TGI with Intel Gaudi
|
|
|
|
- local: installation_inferentia
|
|
|
|
title: Using TGI with AWS Inferentia
|
2024-07-08 07:57:06 -06:00
|
|
|
- local: installation_intel
|
|
|
|
title: Using TGI with Intel GPUs
|
2023-08-10 02:24:52 -06:00
|
|
|
- local: installation
|
MI300 compatibility (#1764)
Adds support for AMD Instinct MI300 in TGI.
Most changes are:
* Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding
https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable.
TunableOp is disabled by default, and can be enabled with
`PYTORCH_TUNABLEOP_ENABLED=1`.
* Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes
from https://github.com/pytorch/pytorch/pull/124362)
* Support SILU & Linear custom kernels contributed by AMD
* Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/,
branching out of a much more recent commit
https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308
* Support FA2 Triton kernel as recommended by AMD. Can be used by
specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`.
* Update dockerfile to ROCm 6.1
By default, TunableOp tuning results are saved in `/data` (e.g.
`/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order
to avoid to have to rerun the tuning at each `docker run`.
Example:
```
Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489
```
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2024-05-17 07:30:47 -06:00
|
|
|
title: Installation from source
|
2023-08-10 02:24:52 -06:00
|
|
|
- local: supported_models
|
|
|
|
title: Supported Models and Hardware
|
2024-01-24 09:41:28 -07:00
|
|
|
- local: messages_api
|
|
|
|
title: Messages API
|
2024-06-14 07:28:34 -06:00
|
|
|
- local: architecture
|
|
|
|
title: Internal Architecture
|
2024-07-19 08:34:04 -06:00
|
|
|
- local: usage_statistics
|
|
|
|
title: Usage Statistics
|
2023-08-10 02:24:52 -06:00
|
|
|
title: Getting started
|
|
|
|
- sections:
|
|
|
|
- local: basic_tutorials/consuming_tgi
|
|
|
|
title: Consuming TGI
|
|
|
|
- local: basic_tutorials/preparing_model
|
|
|
|
title: Preparing Model for Serving
|
|
|
|
- local: basic_tutorials/gated_model_access
|
|
|
|
title: Serving Private & Gated Models
|
2023-08-10 07:00:30 -06:00
|
|
|
- local: basic_tutorials/using_cli
|
|
|
|
title: Using TGI CLI
|
2023-09-27 08:01:38 -06:00
|
|
|
- local: basic_tutorials/launcher
|
2024-05-17 08:34:44 -06:00
|
|
|
title: All TGI CLI options
|
2023-09-12 07:55:14 -06:00
|
|
|
- local: basic_tutorials/non_core_models
|
|
|
|
title: Non-core Model Serving
|
2024-04-05 05:32:53 -06:00
|
|
|
- local: basic_tutorials/safety
|
|
|
|
title: Safety
|
2024-05-01 01:03:25 -06:00
|
|
|
- local: basic_tutorials/using_guidance
|
|
|
|
title: Using Guidance, JSON, tools
|
2024-04-30 04:14:39 -06:00
|
|
|
- local: basic_tutorials/visual_language_models
|
|
|
|
title: Visual Language Models
|
2024-05-17 08:34:44 -06:00
|
|
|
- local: basic_tutorials/monitoring
|
|
|
|
title: Monitoring TGI with Prometheus and Grafana
|
2024-05-23 03:34:18 -06:00
|
|
|
- local: basic_tutorials/train_medusa
|
|
|
|
title: Train Medusa
|
2023-08-10 02:24:52 -06:00
|
|
|
title: Tutorials
|
2023-08-18 05:27:08 -06:00
|
|
|
- sections:
|
|
|
|
- local: conceptual/streaming
|
|
|
|
title: Streaming
|
2023-09-12 07:52:46 -06:00
|
|
|
- local: conceptual/quantization
|
|
|
|
title: Quantization
|
2023-09-12 04:11:20 -06:00
|
|
|
- local: conceptual/tensor_parallelism
|
|
|
|
title: Tensor Parallelism
|
2023-09-08 06:18:42 -06:00
|
|
|
- local: conceptual/paged_attention
|
|
|
|
title: PagedAttention
|
2023-09-07 08:22:06 -06:00
|
|
|
- local: conceptual/safetensors
|
|
|
|
title: Safetensors
|
2023-09-06 07:36:49 -06:00
|
|
|
- local: conceptual/flash_attention
|
|
|
|
title: Flash Attention
|
2024-02-28 03:30:37 -07:00
|
|
|
- local: conceptual/speculation
|
|
|
|
title: Speculation (Medusa, ngram)
|
|
|
|
- local: conceptual/guidance
|
2024-06-25 12:46:27 -06:00
|
|
|
title: How Guidance Works (via outlines
|
|
|
|
- local: conceptual/lora
|
|
|
|
title: LoRA (Low-Rank Adaptation)
|
|
|
|
|
2024-04-30 04:14:39 -06:00
|
|
|
|
2023-08-18 05:27:08 -06:00
|
|
|
title: Conceptual Guides
|