Improving mamba runtime by using updates (#1552)
- Move float16 to bfloat16, which has less imprecisions (load test are
failing with the update kernels + f16, all working under bf16).
Another note, is that we are not respecting the layer norm in f32
defined in the configuration (this is OK in my book, but that could
impact the f16 precision)
- Moved to update kernels. Triton overhead is super high, removed by
switching to cuda graphs works great (update cuda graph is available
in TRT-LLM if needed, seems *exactly* like the regular ssm kernel.
- Moved inference_params struct in order to make only 2 tensors, to
reduce the overhead of copying back and forth to the cuda graphs.
- Left over overhead seems entirely in the tokenization bit. (Still 4
copies are paid before launching the graph)
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
2024-02-14 01:54:10 -07:00
|
|
|
import torch
|
2024-02-14 07:30:07 -07:00
|
|
|
import os
|
2024-05-31 09:57:01 -06:00
|
|
|
from loguru import logger
|
2024-07-20 11:02:04 -06:00
|
|
|
from typing import Dict, Optional
|
|
|
|
|
|
|
|
from text_generation_server.utils.log import log_master
|
Improving mamba runtime by using updates (#1552)
- Move float16 to bfloat16, which has less imprecisions (load test are
failing with the update kernels + f16, all working under bf16).
Another note, is that we are not respecting the layer norm in f32
defined in the configuration (this is OK in my book, but that could
impact the f16 precision)
- Moved to update kernels. Triton overhead is super high, removed by
switching to cuda graphs works great (update cuda graph is available
in TRT-LLM if needed, seems *exactly* like the regular ssm kernel.
- Moved inference_params struct in order to make only 2 tensors, to
reduce the overhead of copying back and forth to the cuda graphs.
- Left over overhead seems entirely in the tokenization bit. (Still 4
copies are paid before launching the graph)
# What does this PR do?
<!--
Congratulations! You've made it this far! You're not quite done yet
though.
Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.
Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.
Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @
@OlivierDehaene OR @Narsil
-->
2024-02-14 01:54:10 -07:00
|
|
|
|
2024-08-12 06:59:17 -06:00
|
|
|
PREFIX_CACHING = os.getenv("USE_PREFIX_CACHING", False)
|
|
|
|
log_master(logger.info, f"Using Attention = {PREFIX_CACHING}")
|
|
|
|
|
|
|
|
ATTENTION = os.getenv("ATTENTION", "flashinfer" if PREFIX_CACHING else "paged")
|
2024-08-09 08:41:17 -06:00
|
|
|
_expected = {"paged", "flashdecoding", "flashinfer"}
|
|
|
|
assert (
|
|
|
|
ATTENTION in _expected
|
|
|
|
), f"Attention is not valid {ATTENTION}, expected {_expected}"
|
|
|
|
log_master(logger.info, f"Using Attention = {ATTENTION}")
|
2024-08-09 03:42:00 -06:00
|
|
|
|
2024-08-12 06:59:17 -06:00
|
|
|
if PREFIX_CACHING and ATTENTION != "flashinfer":
|
|
|
|
raise RuntimeError("Prefix caching is only supported with flashinfer")
|
|
|
|
|
2024-04-26 07:48:58 -06:00
|
|
|
MEM_POOL = torch.cuda.graph_pool_handle() if torch.cuda.is_available() else None
|
2024-08-12 06:59:17 -06:00
|
|
|
|
2024-02-14 07:30:07 -07:00
|
|
|
# This is overridden by the cli
|
2024-08-12 06:59:17 -06:00
|
|
|
BLOCK_SIZE: int
|
|
|
|
if ATTENTION == "flashdecoding":
|
|
|
|
BLOCK_SIZE = 256
|
|
|
|
elif ATTENTION == "flashinfer":
|
|
|
|
BLOCK_SIZE = 1
|
|
|
|
else:
|
|
|
|
BLOCK_SIZE = 16
|
2024-07-01 15:28:00 -06:00
|
|
|
|
2024-08-09 03:42:00 -06:00
|
|
|
|
2024-04-04 15:01:56 -06:00
|
|
|
cuda_graphs = os.getenv("CUDA_GRAPHS")
|
2024-04-30 03:39:38 -06:00
|
|
|
if cuda_graphs is not None:
|
2024-04-04 15:01:56 -06:00
|
|
|
try:
|
|
|
|
cuda_graphs = [int(item) for item in cuda_graphs.split(",")]
|
|
|
|
except Exception as e:
|
|
|
|
raise RuntimeError(
|
|
|
|
f"Could not parse cuda graphs {cuda_graphs}, expected comma separated list for batch sizes to run on: {e}"
|
|
|
|
)
|
2024-04-22 08:09:19 -06:00
|
|
|
else:
|
|
|
|
cuda_graphs = None
|
2024-06-21 12:28:26 -06:00
|
|
|
# sorting the cuda graphs in descending order helps reduce the
|
|
|
|
# memory impact and results in less memory usage
|
|
|
|
if cuda_graphs is not None:
|
|
|
|
cuda_graphs.sort(reverse=True)
|
|
|
|
|
2024-04-04 15:01:56 -06:00
|
|
|
CUDA_GRAPHS = cuda_graphs
|
MI300 compatibility (#1764)
Adds support for AMD Instinct MI300 in TGI.
Most changes are:
* Support PyTorch TunableOp to pick the GEMM/GEMV kernels for decoding
https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cuda/tunable.
TunableOp is disabled by default, and can be enabled with
`PYTORCH_TUNABLEOP_ENABLED=1`.
* Update ROCm dockerfile to PyTorch 2.3 (actually patched with changes
from https://github.com/pytorch/pytorch/pull/124362)
* Support SILU & Linear custom kernels contributed by AMD
* Update vLLM paged attention to https://github.com/fxmarty/rocm-vllm/,
branching out of a much more recent commit
https://github.com/ROCm/vllm/commit/3489ce7936c5de588916ae3047c44c23c0b0c308
* Support FA2 Triton kernel as recommended by AMD. Can be used by
specifying `ROCM_USE_FLASH_ATTN_V2_TRITON=1`.
* Update dockerfile to ROCm 6.1
By default, TunableOp tuning results are saved in `/data` (e.g.
`/data/tunableop_meta-llama-Llama-2-70b-chat-hf_tp1_rank0.csv`) in order
to avoid to have to rerun the tuning at each `docker run`.
Example:
```
Validator,PT_VERSION,2.3.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_Half_TN,tn_8192_7_28672,Gemm_Rocblas_45475,0.132098
GemmTunableOp_Half_TN,tn_10240_4_8192,Gemm_Rocblas_45546,0.0484431
GemmTunableOp_Half_TN,tn_32000_6_8192,Default,0.149546
GemmTunableOp_Half_TN,tn_32000_3_8192,Gemm_Rocblas_45520,0.147119
GemmTunableOp_Half_TN,tn_8192_3_28672,Gemm_Rocblas_45475,0.132645
GemmTunableOp_Half_TN,tn_10240_3_8192,Gemm_Rocblas_45546,0.0482971
GemmTunableOp_Half_TN,tn_57344_5_8192,Gemm_Rocblas_45520,0.255694
GemmTunableOp_Half_TN,tn_10240_7_8192,Gemm_Rocblas_45517,0.0482522
GemmTunableOp_Half_TN,tn_8192_3_8192,Gemm_Rocblas_45546,0.0444671
GemmTunableOp_Half_TN,tn_8192_5_8192,Gemm_Rocblas_45546,0.0445834
GemmTunableOp_Half_TN,tn_57344_7_8192,Gemm_Rocblas_45520,0.25622
GemmTunableOp_Half_TN,tn_8192_2_28672,Gemm_Rocblas_45475,0.132122
GemmTunableOp_Half_TN,tn_8192_4_8192,Gemm_Rocblas_45517,0.0453191
GemmTunableOp_Half_TN,tn_10240_5_8192,Gemm_Rocblas_45517,0.0482514
GemmTunableOp_Half_TN,tn_8192_5_28672,Gemm_Rocblas_45542,0.133914
GemmTunableOp_Half_TN,tn_8192_2_8192,Gemm_Rocblas_45517,0.0446516
GemmTunableOp_Half_TN,tn_8192_1_28672,Gemm_Hipblaslt_TN_10814,0.131953
GemmTunableOp_Half_TN,tn_10240_2_8192,Gemm_Rocblas_45546,0.0481043
GemmTunableOp_Half_TN,tn_32000_4_8192,Gemm_Rocblas_45520,0.147497
GemmTunableOp_Half_TN,tn_8192_6_28672,Gemm_Rocblas_45529,0.134895
GemmTunableOp_Half_TN,tn_57344_2_8192,Gemm_Rocblas_45520,0.254716
GemmTunableOp_Half_TN,tn_57344_4_8192,Gemm_Rocblas_45520,0.255731
GemmTunableOp_Half_TN,tn_10240_6_8192,Gemm_Rocblas_45517,0.0484816
GemmTunableOp_Half_TN,tn_57344_3_8192,Gemm_Rocblas_45520,0.254701
GemmTunableOp_Half_TN,tn_8192_4_28672,Gemm_Rocblas_45475,0.132159
GemmTunableOp_Half_TN,tn_32000_2_8192,Default,0.147524
GemmTunableOp_Half_TN,tn_32000_5_8192,Default,0.147074
GemmTunableOp_Half_TN,tn_8192_6_8192,Gemm_Rocblas_45546,0.0454045
GemmTunableOp_Half_TN,tn_57344_6_8192,Gemm_Rocblas_45520,0.255582
GemmTunableOp_Half_TN,tn_32000_7_8192,Default,0.146705
GemmTunableOp_Half_TN,tn_8192_7_8192,Gemm_Rocblas_45546,0.0445489
```
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
2024-05-17 07:30:47 -06:00
|
|
|
|
2024-06-25 12:46:27 -06:00
|
|
|
# NOTE: eventually we should move this into the router and pass back the
|
|
|
|
# index in all cases.
|
2024-07-20 11:02:04 -06:00
|
|
|
ADAPTER_TO_INDEX: Optional[Dict[str, int]] = None
|
2024-06-25 12:46:27 -06:00
|
|
|
|
|
|
|
|
|
|
|
def set_adapter_to_index(adapter_to_index: Dict[str, int]):
|
|
|
|
global ADAPTER_TO_INDEX
|
|
|
|
ADAPTER_TO_INDEX = adapter_to_index
|
2024-06-27 08:04:20 -06:00
|
|
|
|
|
|
|
|
|
|
|
def get_adapter_to_index():
|
|
|
|
global ADAPTER_TO_INDEX
|
|
|
|
return ADAPTER_TO_INDEX
|