hf_text-generation-inference/nix/impure-shell.nix

{
  lib,
  mkShell,
  black,
  cmake,
  isort,
  ninja,
  which,
  cudaPackages,
  openssl,
  pkg-config,
  protobuf,
  python3,
  pyright,
  redocly,
  ruff,
  rust-bin,
  server,

  # Enable dependencies for building CUDA packages. Useful for e.g.
  # developing marlin/moe-kernels in-place.
  withCuda ? false,
}:

mkShell {
  nativeBuildInputs =
    [
      black
      isort
      pkg-config
      (rust-bin.stable.latest.default.override {
        extensions = [
          "rust-analyzer"
          "rust-src"
        ];
      })
      protobuf
      pyright
      redocly
      ruff
    ]
    ++ (lib.optionals withCuda [
      cmake
      ninja
      which

      # For most Torch-based extensions, setting CUDA_HOME is enough, but
      # some custom CMake builds (e.g. vLLM) also need to have nvcc in PATH.
      cudaPackages.cuda_nvcc
    ]);
  buildInputs =
    [
      openssl.dev
    ]
    ++ (with python3.pkgs; [
      venvShellHook
      docker
      pip
      ipdb
      click
      pytest
      pytest-asyncio
      syrupy
    ])
    ++ (lib.optionals withCuda (
      with cudaPackages;
      [
        cuda_cccl
        cuda_cudart
        cuda_nvrtc
        cuda_nvtx
        cuda_profiler_api
        cudnn
        libcublas
        libcusolver
        libcusparse
      ]
    ));

  inputsFrom = [ server ];

  env = lib.optionalAttrs withCuda {
    CUDA_HOME = "${lib.getDev cudaPackages.cuda_nvcc}";
    TORCH_CUDA_ARCH_LIST = lib.concatStringsSep ";" python3.pkgs.torch.cudaCapabilities;
  };

  venvDir = "./.venv";

  postVenvCreation = ''
    unset SOURCE_DATE_EPOCH
    ( cd server ; python -m pip install --no-dependencies -e . )
    ( cd clients/python ; python -m pip install --no-dependencies -e . )
  '';

  postShellHook = ''
    unset SOURCE_DATE_EPOCH
    export PATH=$PATH:~/.cargo/bin
  '';
}
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`{`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`lib,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`mkShell,`
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black. 2024-10-09 03:08:02 -06:00			`black,`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`cmake,`
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black. 2024-10-09 03:08:02 -06:00			`isort,`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`ninja,`
			`which,`
			`cudaPackages,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`openssl,`
			`pkg-config,`
			`protobuf,`
			`python3,`
			`pyright,`
			`redocly,`
			`ruff,`
			`rust-bin,`
			`server,`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00
			`# Enable dependencies for building CUDA packages. Useful for e.g.`
			`# developing marlin/moe-kernels in-place.`
			`withCuda ? false,`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`}:`

			`mkShell {`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`nativeBuildInputs =`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`[`
nix: add black and isort to the closure (#2619) To make sure that everything is formatted with the same black version as CI. I sometimes use isort for new files to get nicely ordered imports, so add it as well. Also set the isort configuration to format in a way that is compatible with black. 2024-10-09 03:08:02 -06:00			`black`
			`isort`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`pkg-config`
			`(rust-bin.stable.latest.default.override {`
			`extensions = [`
			`"rust-analyzer"`
			`"rust-src"`
			`];`
			`})`
			`protobuf`
			`pyright`
			`redocly`
			`ruff`
			`]`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`++ (lib.optionals withCuda [`
			`cmake`
			`ninja`
			`which`

			`# For most Torch-based extensions, setting CUDA_HOME is enough, but`
			`# some custom CMake builds (e.g. vLLM) also need to have nvcc in PATH.`
			`cudaPackages.cuda_nvcc`
			`]);`
			`buildInputs =`
			`[`
			`openssl.dev`
			`]`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`++ (with python3.pkgs; [`
			`venvShellHook`
			`docker`
			`pip`
			`ipdb`
			`click`
			`pytest`
			`pytest-asyncio`
			`syrupy`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`])`
			`++ (lib.optionals withCuda (`
			`with cudaPackages;`
			`[`
			`cuda_cccl`
			`cuda_cudart`
feat: natively support Granite models (#2682) * feat: natively support Granite models * Update doc 2024-10-23 04:04:05 -06:00			`cuda_nvrtc`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`cuda_nvtx`
feat: natively support Granite models (#2682) * feat: natively support Granite models * Update doc 2024-10-23 04:04:05 -06:00			`cuda_profiler_api`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`cudnn`
			`libcublas`
			`libcusolver`
			`libcusparse`
			`]`
			`));`
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00
			`inputsFrom = [ server ];`

Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00			`env = lib.optionalAttrs withCuda {`
			`CUDA_HOME = "${lib.getDev cudaPackages.cuda_nvcc}";`
			`TORCH_CUDA_ARCH_LIST = lib.concatStringsSep ";" python3.pkgs.torch.cudaCapabilities;`
			`};`

Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`venvDir = "./.venv";`

			`postVenvCreation = ''`
			`unset SOURCE_DATE_EPOCH`
			`( cd server ; python -m pip install --no-dependencies -e . )`
			`( cd clients/python ; python -m pip install --no-dependencies -e . )`
			`'';`
Add `impureWithCuda` dev shell (#2677) * Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN 2024-10-22 03:02:55 -06:00
Improve support for GPUs with capability < 8 (#2575) * Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s 2024-09-27 08:19:42 -06:00			`postShellHook = ''`
			`unset SOURCE_DATE_EPOCH`
			`export PATH=$PATH:~/.cargo/bin`
			`'';`
			`}`