Go to file

Yang, Bo d2ae3581bf Claim copyright (#7 )		2023-08-02 17:23:54 -07:00
.github	chore: migrate ci region for more availability. (#581 )	2023-07-12 10:01:01 +02:00
assets	feat(benchmark): tui based benchmarking tool (#149 )	2023-03-30 15:26:27 +02:00
benchmark	docs(benchmarker): Adding some help for the options in `text-generation-benchmark`. (#462 )	2023-07-04 18:35:37 +02:00
clients/python	feat(server): only compute prefill logprobs when asked (#406 )	2023-06-02 17:12:30 +02:00
docs	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
integration-tests	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
launcher	feat: add cuda memory fraction (#659 )	2023-07-24 11:43:58 +02:00
load_tests	feat: add nightly load testing (#358 )	2023-05-23 17:42:19 +02:00
proto	feat(server): auto max_batch_total_tokens for flash att models (#630 )	2023-07-19 09:31:25 +02:00
router	feat(server): update vllm version (#723 )	2023-07-28 15:36:38 +02:00
server	Don't enable custom kernels if CUDA is not available (#6 )	2023-08-02 09:51:54 -07:00
.dockerignore	chore: add `flash-attention` to docker ignore (#287 )	2023-05-05 17:52:09 +02:00
.gitignore	feat(server): Rework model loading (#344 )	2023-06-08 14:51:52 +02:00
Cargo.lock	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
Cargo.toml	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
Dockerfile	fix(server): fix missing datasets in quantize	2023-07-27 14:50:45 +02:00
LICENSE	Claim copyright (#7 )	2023-08-02 17:23:54 -07:00
Makefile	docs(README): update readme	2023-07-25 19:45:25 +02:00
README-HuggingFace.md	Add a new README (#3 )	2023-08-01 12:22:07 -07:00
README.md	Add a new README (#3 )	2023-08-01 12:22:07 -07:00
rust-toolchain.toml	v0.9.0 (#525 )	2023-07-01 19:25:41 +02:00
sagemaker-entrypoint.sh	feat(sagemaker): add trust remote code to entrypoint (#394 )	2023-06-02 09:51:06 +02:00

README.md

Text Generation Inference

This is Preemo's fork of text-generation-inference, originally developed by Hugging Face. The original README is at README-HuggingFace.md. Since Hugging Face's text-generation-inference is no longer open-source, we have forked it and will continue to develop it here.

Our goal is to create an open-source text generation inference server that is modularized to allow for easy add state-of-the-art models, functionalities and optimizations. Functionalities and optimizations should be composable, so that users can easily combine them to create a custom inference server that fits their needs.

our plan

We at Preemo are currently busy working on our first release of our other product, so we expect to be able to start open-source development on this repository in September 2023. We will be working on the following, to ease the external contributions:

Adding a public visible CI/CD pipeline that runs tests and builds docker images
Unifying the build tools
Modularizing the codebase, introducing a plugin system

Our long-term goal is to grow the community around this repository, as a playground for trying out new ideas and optimizations in LLM inference. We at Preemo will implement features that interest us, but we also welcome contributions from the community, as long as they are modularized and composable.