You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Michael Feil 012c917b6f
Wrapping completions and chat/completions endpoint (#2)
* rebase and squash commits on latest main

* cargo fmt

* fix: 2038y problem


Co-authored-by: michaelfeil <>
3 days ago
.github update PR template 2 months ago
assets feat(benchmark): tui based benchmarking tool (#149) 6 months ago
benchmark Compilation fix: Correct method argument types in and (#10) 1 month ago
clients/python feat(server): only compute prefill logprobs when asked (#406) 4 months ago
docs Wrapping completions and chat/completions endpoint (#2) 3 days ago
integration-tests feat: add cuda memory fraction (#659) 2 months ago
launcher restoring commit from dev branch, rebase on current master 2 months ago
load_tests feat: add nightly load testing (#358) 4 months ago
proto feat(server): auto max_batch_total_tokens for flash att models (#630) 2 months ago
router Wrapping completions and chat/completions endpoint (#2) 3 days ago
server Merge pull request #4 from michaelfeil/bnb_4bit 3 weeks ago
.dockerignore chore: add `flash-attention` to docker ignore (#287) 5 months ago
.gitignore feat(server): Rework model loading (#344) 4 months ago
Cargo.lock v0.9.4 (#713) 2 months ago
Cargo.toml v0.9.4 (#713) 2 months ago
Dockerfile Wrapping completions and chat/completions endpoint (#2) 3 days ago
LICENSE Claim copyright (#7) 2 months ago
Makefile docs(README): update readme 2 months ago Add a new README (#3) 2 months ago Update 2 months ago
rust-toolchain.toml v0.9.0 (#525) 3 months ago feat(sagemaker): add trust remote code to entrypoint (#394) 4 months ago

Text Generation Inference

This is Preemo's fork of text-generation-inference, originally developed by Hugging Face. The original README is at Since Hugging Face's text-generation-inference is no longer open-source, we have forked it and will continue to develop it here.

Our goal is to create an open-source text generation inference server that is modularized to allow for easy add state-of-the-art models, functionalities and optimizations. Functionalities and optimizations should be composable, so that users can easily combine them to create a custom inference server that fits their needs.

our plan

We at Preemo are currently busy working on our first release of our other product, so we expect to be able to start open-source development on this repository in September 2023. We will be working on the following, to ease the external contributions:

  • Adding a public visible CI/CD pipeline that runs tests and builds docker images
  • Unifying the build tools
  • Modularizing the codebase, introducing a plugin system

Our long-term goal is to grow the community around this repository, as a playground for trying out new ideas and optimizations in LLM inference. We at Preemo will implement features that interest us, but we also welcome contributions from the community, as long as they are modularized and composable.

Extra features in comparison to Hugging Face text-generation-inference v0.9.4

4bit quantization

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.