Go to file

Yang, Bo f93012d59c Merge pull request #4 from michaelfeil/bnb_4bit 4bit quantization with bitsandbytes		2023-09-08 14:52:32 -07:00
.github	update PR template	2023-08-01 18:18:28 +02:00
assets	…
benchmark	Compilation fix: Correct method argument types in generation.rs and validation.rs (#10 )	2023-08-23 13:52:49 -07:00
clients/python	…
docs	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
integration-tests	…
launcher	restoring commit from dev branch, rebase on current master	2023-08-01 18:15:18 +02:00
load_tests	…
proto	…
router	Compilation fix: Correct method argument types in generation.rs and validation.rs (#10 )	2023-08-23 13:52:49 -07:00
server	Merge pull request #4 from michaelfeil/bnb_4bit	2023-09-08 14:52:32 -07:00
.dockerignore	…
.gitignore	…
Cargo.lock	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
Cargo.toml	v0.9.4 (#713 )	2023-07-27 19:25:15 +02:00
Dockerfile	…
LICENSE	Claim copyright (#7 )	2023-08-02 17:23:54 -07:00
Makefile	…
README-HuggingFace.md	Add a new README (#3 )	2023-08-01 12:22:07 -07:00
README.md	Update README.md	2023-08-03 23:23:02 +02:00
rust-toolchain.toml	…
sagemaker-entrypoint.sh	…

README.md

Text Generation Inference

This is Preemo's fork of text-generation-inference, originally developed by Hugging Face. The original README is at README-HuggingFace.md. Since Hugging Face's text-generation-inference is no longer open-source, we have forked it and will continue to develop it here.

Our goal is to create an open-source text generation inference server that is modularized to allow for easy add state-of-the-art models, functionalities and optimizations. Functionalities and optimizations should be composable, so that users can easily combine them to create a custom inference server that fits their needs.

our plan

We at Preemo are currently busy working on our first release of our other product, so we expect to be able to start open-source development on this repository in September 2023. We will be working on the following, to ease the external contributions:

Adding a public visible CI/CD pipeline that runs tests and builds docker images
Unifying the build tools
Modularizing the codebase, introducing a plugin system

Our long-term goal is to grow the community around this repository, as a playground for trying out new ideas and optimizations in LLM inference. We at Preemo will implement features that interest us, but we also welcome contributions from the community, as long as they are modularized and composable.

Extra features in comparison to Hugging Face `text-generation-inference` v0.9.4

4bit quantization

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

README.md

Text Generation Inference

our plan

Extra features in comparison to Hugging Face text-generation-inference v0.9.4

4bit quantization

Extra features in comparison to Hugging Face `text-generation-inference` v0.9.4